Selecting documents for human-reading • wateReview

This vignette explains how documents are selected for the human-reading. Unfortunately, this vignette manipulates actual .pdf articles which we cannot distribute with this R package. First, let’s load the package and define out_dir output directory.

library(wateReview)
root_dir <- get_rootdir()
out_dir <- "exploitation/out/run79"
out_dir <- file.path(root_dir, out_dir)

Then, we have to indicate where the functions are expected to find the document .pdf files. Here, the files are located in a *.Data directory corresponding to files automagically retrieved by EndNote. The .pdf files names are then read and formatted by fix_names.

pdf_dir <- "data/latin_america/corpus_pdf/english/english.Data/"
pdf_dir <- file.path(root_dir, pdf_dir)
l <- get_pdf_files(file.path(root_dir, pdf_dir))
full_names <- unname(l$full_names)

Alternatively, full_names can be accessed from an already derived database.

# 1. read csv databases
l <- fix_names("english", full_names)
# load(file = file.path(root_dir, out_dir, paste0("language_dfs_updated_2", ".rda")))
names <- l$names
full_names <- l$full_names

non_duplicate_index <- get_non_duplicate_pdfs(names)
select_ind <- article_selection(full_names, non_duplicate_index, ratio = 0.2)

number_of_readers <- 3
assign_articles_to_readers(select_ind, number_of_readers, paste0(out_dir, "/", lang))