normality.Rmd
In this vignette, we derive the normality of five categories of topics. Normality from the Jensen-Shannon distance, \(d_{JS}\), (Endres and Schindelin 2003), defined as the square-root of the Jensen-Shannon divergence, \(D_{JS}\), (Lin 1991):
\[ D_{JS}(N,Q) = \frac{1}{2} \lbrack D_{KL}(N,R) + D_{KL}(P,R) \rbrack \]
with \(N,P\), probability distributions, \(R = \frac{1}{2} (N+P)\) the midpoint probability and \(D_{KL}\), the Kullback-Leibler divergence (Kullback and Leibler 1951). Here, we take \(N = \mathcal{N}(0,1)\), the standard normal distribution with mean \(\mu = 0\) and standard deviation \(\sigma = 1\). As \(d_{JS} = 0\) when \(P = \mathcal{N}(0,1)\), to better meet the intuition of normality, we define normality as:
\[ \textrm{Normality} = 1 - d_{JS} \]
so that normality is closer to 1 when the \(P\) is closer to the standard normal distribution.
For a given topic, normality is defined for two types of distributions \(P\), forming the two axis of subsequent graphs:
We access the consolidated results stored in extdata
using system.file()
.
general <- readRDS(system.file( "extdata", "consolidated_results_NSF_general.Rds", package = "wateReview") ) specific <- readRDS(system.file( "extdata", "consolidated_results_NSF_specific.Rds", package = "wateReview") ) methods <- readRDS(system.file( "extdata", "consolidated_results_methods.Rds", package = "wateReview") ) budget <- readRDS(system.file( "extdata", "consolidated_results_water budget.Rds", package = "wateReview") ) theme <- readRDS(system.file( "extdata", "consolidated_results_theme.Rds", package = "wateReview") )
First, we select the countries with at least 30 documents to keep the stastitics tidy.
library(wateReview)
Next, we calculate the normality across countries and topics for the five categories with:
topic_categories <- list(general, specific, methods, budget, theme) distances <- lapply(topic_categories, function(category){ probs <- reduce_docs_for_JSd(category) country_distance <- get_country_distance(probs) topic_distance <- get_topic_distance(probs) return(merge(country_distance, topic_distance, by = "topic")) })
Let’s load some plotting libraries and create the base plots.
library(ggplot2) library(ggpubr) library(plotly) normality_plots <- lapply(distances, function(category_distance){ ggplot(category_distance, aes(topic_distance, country_distance, label = topic)) + geom_point() + theme_pubr() + coord_fixed() + labs(x = "Normality across documents",y = "Normality across countries") })
Now, we adjust the base plots according to each category.
Endres, Dominik Maria, and Johannes E Schindelin. 2003. “A New Metric for Probability Distributions.” IEEE Transactions on Information Theory. https://doi.org/10.1109/TIT.2003.813506.
Kullback, Solomon, and Richard A Leibler. 1951. “On Information and Sufficiency.” The Annals of Mathematical Statistics 22 (1). JSTOR: 79–86. https://doi.org/10.1214/aoms/1177729694.
Lin, Jianhua. 1991. “Divergence Measures Based on the Shannon Entropy.” IEEE Transactions on Information Theory 37 (1). IEEE: 145–51. https://doi.org/10.1109/18.61115.