select_human_reading.Rmd
This vignette explains how documents are selected for the human-reading. Unfortunately, this vignette manipulates actual .pdf
articles which we cannot distribute with this R
package. First, let’s load the package and define out_dir
output directory.
library(wateReview)
root_dir <- get_rootdir()
out_dir <- "exploitation/out/run79"
out_dir <- file.path(root_dir, out_dir)
Then, we have to indicate where the functions are expected to find the document .pdf
files. Here, the files are located in a *.Data
directory corresponding to files automagically retrieved by EndNote. The .pdf
files names are then read and formatted by fix_names
.
pdf_dir <- "data/latin_america/corpus_pdf/english/english.Data/"
pdf_dir <- file.path(root_dir, pdf_dir)
l <- get_pdf_files(file.path(root_dir, pdf_dir))
full_names <- unname(l$full_names)
Alternatively, full_names
can be accessed from an already derived database.
# 1. read csv databases
l <- fix_names("english", full_names)
# load(file = file.path(root_dir, out_dir, paste0("language_dfs_updated_2", ".rda")))
names <- l$names
full_names <- l$full_names
non_duplicate_index <- get_non_duplicate_pdfs(names)
select_ind <- article_selection(full_names, non_duplicate_index, ratio = 0.2)
number_of_readers <- 3
assign_articles_to_readers(select_ind, number_of_readers, paste0(out_dir, "/", lang))