Purpose

This vignette showcases the workflow to process the .csv files resulting from the online query. For each language, the files are processed and checked for unique hits and titles.

Processing

languages <- c("english", "spanish", "portuguese")
for (language in languages){
    csv.dir <- "data/latin_america/corpus_csv/"
    csv.dir <- file.path(csv.dir, language)
    csv.files <- get_csv_files(csv.dir)
    citation_dataframe <- get_citation_dataframe(csv.files)
    citation_dataframe <- check_duplicate_row(citation_dataframe)
    citation_dataframe <- check_duplicate_title(citation_dataframe)
    write_citation_dataframe(csv.dir)
}

Resulting corpi

Language Number of non-duplicate query returns Number of documents automatically collected Number of document manually collected Total Corpus Size
English 29,365 21,197 (72%) 0 21,197 (72%)
Spanish 1,411 122 (8.6%) 875 997 (71%)
Portuguese 777 300 (39%) 261 561 (72%)