Purpose

This vignette clarifies the calculation of the citation network. Unfortunately, this vignette requires a large file containing the entire data of the corpus which cannot be distributed with the package. For that reason, the code of this vignette is not executed.

Bi-partite citation networks

The citation network was extracted from article metadata which stores identifiers for the cited articles and citing article (i.e. the article to which the metadata are attached to). A total of 29,900 citations were found between 4,603 unique articles of the English corpus. The resulting bibliographic network is defined by its \(4,603\times4,603\)-adjacency matrix \(B\) and is filtered to remove edges between node external to the collected corpus. The citation network between countries, \(B_C\) is obtained by:

\[ B_C = C^T \times B \times C \]

where \(C\) is a \(4,603\times23\)-matrix containing the output of the machine learning probabilistic predictions of the location of study between \(23\) possible country of study. In consequence, \(B_C\) is a weighted directed adjacency matrix. Similar reasoning leads to the citation network between specific research topics, method topics and water budget topics by changing the matrix \(C\) in the previous equation for the LDA probabilistic predictions of topics.

library(wateReview)

Data loading

englishCorpus_file <- "F:/hguillon/research/exploitation/R/latin_america/data/english_corpus.Rds"
englishCorpus <- readRDS(englishCorpus_file)
in_corpus <- readRDS(system.file("extdata", "in_corpus.Rds", package = "wateReview"))
source_ids <- readRDS(system.file("extdata", "source_ids.Rds", package = "wateReview")) # aligned with corpus
citation_network <- readRDS(system.file("extdata", "citingDf.Rds", package = "wateReview"))

Cross-walk between topic model and EndNote databases

The following code chunks perform a cross-walk between the topic model and EndNote databases using document id: EndNoteIdcorpus and EndNoteIdLDA. This ensures that the two databases can communicate and are aligned.

Get document IDs

EndNoteIdcorpus <- get_EndNoteIdcorpus(in_corpus)
EndNoteIdLDA <- get_EndNoteIdLDA(englishCorpus)
QA.EndNoteIdCorpusLDA(EndNoteIdLDA, EndNoteIdcorpus)

Align databases

in_corpus <- align_dataWithEndNoteIdcorpus(in_corpus, EndNoteIdcorpus, EndNoteIdLDA)
source_ids <- align_dataWithEndNoteIdcorpus(source_ids, EndNoteIdcorpus, EndNoteIdLDA)
englishCorpus <- align_dataWithEndNoteIdLDA(englishCorpus, EndNoteIdLDA, EndNoteIdcorpus)
EndNoteIdLDA <- align_dataWithEndNoteIdLDA(EndNoteIdLDA, EndNoteIdLDA, EndNoteIdcorpus)
EndNoteIdcorpus <- align_dataWithEndNoteIdcorpus(EndNoteIdcorpus, EndNoteIdcorpus, EndNoteIdLDA)
QA_EndNoteIdCorpusLDA(EndNoteIdLDA, EndNoteIdcorpus)

Order databases according to topic model database

source_ids <- order_data(source_ids, EndNoteIdLDA, EndNoteIdcorpus)
in_corpus <- order_data(in_corpus, EndNoteIdLDA, EndNoteIdcorpus)
predCountryMembership <- readRDS(system.file("extdata", "predCountryMembership.Rds", package = "wateReview"))
colnames(predCountryMembership) <- gsub("prob.", "", colnames(predCountryMembership))

Make and save the citation networks

For countries, the citation network is obtained using make_countryNetwork() For topics, citation network are performed with make_topicNetwork() for six categories of topics: methods, NSF_general, NSF_specific, spatial scale, theme, and water budget.

countryNetwork <- make_countryNetwork(source_ids, citation_network)
type_list <- c("methods", "NSF_general", "NSF_specific", "spatial scale", "theme", "water budget")
for (tt in type_list) `make_topicNetwork`(source_ids, citation_network, type = tt, percentile.threshold = 0.90)