Purpose

In this vignette, we derive the normality of five categories of topics. Normality from the Jensen-Shannon distance, \(d_{JS}\), (Endres and Schindelin 2003), defined as the square-root of the Jensen-Shannon divergence, \(D_{JS}\), (Lin 1991):

\[ D_{JS}(N,Q) = \frac{1}{2} \lbrack D_{KL}(N,R) + D_{KL}(P,R) \rbrack \]

with \(N,P\), probability distributions, \(R = \frac{1}{2} (N+P)\) the midpoint probability and \(D_{KL}\), the Kullback-Leibler divergence (Kullback and Leibler 1951). Here, we take \(N = \mathcal{N}(0,1)\), the standard normal distribution with mean \(\mu = 0\) and standard deviation \(\sigma = 1\). As \(d_{JS} = 0\) when \(P = \mathcal{N}(0,1)\), to better meet the intuition of normality, we define normality as:

\[ \textrm{Normality} = 1 - d_{JS} \]

so that normality is closer to 1 when the \(P\) is closer to the standard normal distribution.

For a given topic, normality is defined for two types of distributions \(P\), forming the two axis of subsequent graphs:

  1. the probability distribution of the percentage of research devoted to the topic across countries;
  2. the probability distribution of topic probabilities across documents.

Data loading

We access the consolidated results stored in extdata using system.file().

general <- readRDS(system.file(
    "extdata",
    "consolidated_results_NSF_general.Rds",
    package = "wateReview")
)
specific <- readRDS(system.file(
    "extdata",
    "consolidated_results_NSF_specific.Rds",
    package = "wateReview")
)
methods <- readRDS(system.file(
    "extdata",
    "consolidated_results_methods.Rds",
    package = "wateReview")
)
budget <- readRDS(system.file(
    "extdata",
    "consolidated_results_water budget.Rds",
    package = "wateReview")
)
theme <- readRDS(system.file(
    "extdata",
    "consolidated_results_theme.Rds",
    package = "wateReview")
)

Deriving normality

First, we select the countries with at least 30 documents to keep the stastitics tidy.

library(wateReview)

Next, we calculate the normality across countries and topics for the five categories with:

topic_categories <- list(general, specific, methods, budget, theme)
distances <- lapply(topic_categories, function(category){
  probs <- reduce_docs_for_JSd(category)
  country_distance <- get_country_distance(probs)
  topic_distance <- get_topic_distance(probs)
  return(merge(country_distance, topic_distance, by = "topic"))
})

Visualizations

Let’s load some plotting libraries and create the base plots.

library(ggplot2)
library(ggpubr)
library(plotly)
normality_plots <- lapply(distances, function(category_distance){
  ggplot(category_distance, aes(topic_distance, country_distance, label = topic)) +
  geom_point() +
  theme_pubr() +
  coord_fixed() +
  labs(x = "Normality across documents",y = "Normality across countries")
})

Now, we adjust the base plots according to each category.

General

ggplotly(
  normality_plots[[1]] + labs(title = "General topics"),
  tooltip = c('topic')
  )

Specific

ggplotly(
  normality_plots[[2]] + labs(title = "Specific  topics"),
  tooltip = c('topic')
  )

Methods

ggplotly(
  normality_plots[[3]] + ylim(0.5,0.93) + xlim(0.26,0.67) + labs(title = "Research methods"),
  tooltip = c('topic')
  )

Budget

ggplotly(
  normality_plots[[4]] + ylim(0.5,0.93) + xlim(0.26,0.67) + labs(title = "Components of water budget"),
  tooltip = c('topic')
  )

Themes

ggplotly(
  normality_plots[[5]] + labs(title = "Themes"),
  tooltip = c('topic')
  )

References

Endres, Dominik Maria, and Johannes E Schindelin. 2003. “A New Metric for Probability Distributions.” IEEE Transactions on Information Theory. https://doi.org/10.1109/TIT.2003.813506.

Kullback, Solomon, and Richard A Leibler. 1951. “On Information and Sufficiency.” The Annals of Mathematical Statistics 22 (1). JSTOR: 79–86. https://doi.org/10.1214/aoms/1177729694.

Lin, Jianhua. 1991. “Divergence Measures Based on the Shannon Entropy.” IEEE Transactions on Information Theory 37 (1). IEEE: 145–51. https://doi.org/10.1109/18.61115.