`normality.Rmd`

In this vignette, we derive the *normality* of five categories of topics. Normality from the Jensen-Shannon distance, \(d_{JS}\), (Endres and Schindelin 2003), defined as the square-root of the Jensen-Shannon divergence, \(D_{JS}\), (Lin 1991):

\[ D_{JS}(N,Q) = \frac{1}{2} \lbrack D_{KL}(N,R) + D_{KL}(P,R) \rbrack \]

with \(N,P\), probability distributions, \(R = \frac{1}{2} (N+P)\) the midpoint probability and \(D_{KL}\), the Kullback-Leibler divergence (Kullback and Leibler 1951). Here, we take \(N = \mathcal{N}(0,1)\), the standard normal distribution with mean \(\mu = 0\) and standard deviation \(\sigma = 1\). As \(d_{JS} = 0\) when \(P = \mathcal{N}(0,1)\), to better meet the intuition of normality, we define normality as:

\[ \textrm{Normality} = 1 - d_{JS} \]

so that normality is closer to 1 when the \(P\) is closer to the standard normal distribution.

For a given topic, normality is defined for two types of distributions \(P\), forming the two axis of subsequent graphs:

- the probability distribution of the percentage of research devoted to the topic across countries;
- the probability distribution of topic probabilities across documents.

We access the consolidated results stored in `extdata`

using `system.file()`

.

general <- readRDS(system.file( "extdata", "consolidated_results_NSF_general.Rds", package = "wateReview") ) specific <- readRDS(system.file( "extdata", "consolidated_results_NSF_specific.Rds", package = "wateReview") ) methods <- readRDS(system.file( "extdata", "consolidated_results_methods.Rds", package = "wateReview") ) budget <- readRDS(system.file( "extdata", "consolidated_results_water budget.Rds", package = "wateReview") ) theme <- readRDS(system.file( "extdata", "consolidated_results_theme.Rds", package = "wateReview") )

First, we select the countries with at least 30 documents to keep the stastitics tidy.

library(wateReview)

Next, we calculate the normality across countries and topics for the five categories with:

topic_categories <- list(general, specific, methods, budget, theme) distances <- lapply(topic_categories, function(category){ probs <- reduce_docs_for_JSd(category) country_distance <- get_country_distance(probs) topic_distance <- get_topic_distance(probs) return(merge(country_distance, topic_distance, by = "topic")) })

Let’s load some plotting libraries and create the base plots.

library(ggplot2) library(ggpubr) library(plotly) normality_plots <- lapply(distances, function(category_distance){ ggplot(category_distance, aes(topic_distance, country_distance, label = topic)) + geom_point() + theme_pubr() + coord_fixed() + labs(x = "Normality across documents",y = "Normality across countries") })

Now, we adjust the base plots according to each category.

Endres, Dominik Maria, and Johannes E Schindelin. 2003. “A New Metric for Probability Distributions.” *IEEE Transactions on Information Theory*. https://doi.org/10.1109/TIT.2003.813506.

Kullback, Solomon, and Richard A Leibler. 1951. “On Information and Sufficiency.” *The Annals of Mathematical Statistics* 22 (1). JSTOR: 79–86. https://doi.org/10.1214/aoms/1177729694.

Lin, Jianhua. 1991. “Divergence Measures Based on the Shannon Entropy.” *IEEE Transactions on Information Theory* 37 (1). IEEE: 145–51. https://doi.org/10.1109/18.61115.