All functions

EDA_trainingData()

Performs a simple visualization of multilabel training data using mldr package

PerfVisMultilabel()

Make a performance plot for multilabel classification

QA.AuthKeywords()

Quality analysis on the author keywords retrieved

QA_EndNoteIdCorpusLDA()

Performs a quick QA on both sources for document ID

QA_alignedData()

Perform QA/QC on aligned data

QA_oldXnewPredictions()

this function performs a quick quality analysis and difference with historical predictions

VizSpots()

Produces a chord diagram visualization with country cluster colors and topics categories

add_abstractsToCorpus()

Add retrieved abstract to corpus

align_dataWithEndNoteIdLDA()

Align databases based on shared ID

align_dataWithEndNoteIdcorpus()

Align databases based on shared ID

align_englishCorpus()

Align englishCorpus with matching records in both databases and assign webscrapped abstracts

align_humanReadingTopicModel()

identifies the subset of paper with validation data and align databases

article_selection()

Randomly select articles for human-reading

assign_articles_to_players()

Create a .csv files with the information to download the documents

assign_articles_to_readers()

Chunk the selected documents between a given number of human readers and copy them into a given directory

check_duplicate_row()

Return the unique rows of a data.frame.

check_duplicate_title()

Return a data.frame with unique Titles.

clStab()

Evaluates the cluster stability

consolidate_LDA_results()

Consolidates lda results by adding year and country prediction to the topicDocs

count_nas()

Count the number of NA

diversity_LAC()

Calculates a diversity over the entire LAC region

diversity_country()

Calculates the diversity

filter_by_country()

Filter rows of a data.frame by country

filter_columns()

Filter columns of a data.frame

filter_dfm()

Filter the complete document-feature matrix to retain all features with occurence higher than the lowest occurence of country tokens. This function mainly serves to limit the size of the document-feature matrix

fix_names()

Fix the format of the document names from an existing database

format_data4get_ps()

Format the data to the format expected by get_ps

generate_label_df()

General a data.frame of labels for a violin plot

get_DocTermMatrix()

read document-term matrix created by the text mining code

get_EndNoteIdLDA()

get document ID from LDA corpus database

get_EndNoteIdcorpus()

get document ID from EndNote query corpus database

get_JSd_corpus()

Calculates the Jensen-Shannon distance for countries

get_JSd_country()

Calculates the Jensen-Shannon distance for countries

get_MLDR()

internal function to get MLDR

get_allAuthKeywords()

Extracts all author keywords from the metadata results

get_allMetadata()

Extract metadata from Scopus or Web of Science identifiers

get_binaryRelevanceLearners()

Convience legacy function to create binary relance wrappers from MLR

get_boolean_AuthKeywords()

Transform the author keywords into a multilabel dataset

get_chainingOrder()

Get chaining order from MLDR

get_citation_dataframe()

Read .csv files to create a citation data.frame

get_countries()

Extract the list of possible country

get_country_distance()

Get the Jensen-Shannon distance across countries

get_csv_files()

List .csv files in a directory

get_dfm()

Retrieve or create a document-feature matrix (dfm) from hard coded options relevant to current project

get_dtm()

Get document-term matrix from a document-feature matrix and a list of tokens

get_endnote_titles()

Retrieve the titles of the documents in the English corpus

get_endnote_xml()

Parse the .xml database from EndNote

get_ind_hasCountryTag()

Identifies if a document has a tag

get_language()

Interactive prompt to select language

get_language_dfs()

Read the citation data frame and store them into a named list

get_mail()

Extract email addresses from pdf documents You probably want to execute this code on a linux server to avoid the issues with special character handling on Windows

get_meta_df()

Binds the separate languages data.frame into a meta data.frame

get_n_players()

Prompt to get the number of players

get_network()

Creates the adjencency matrix of a bi-partite network: country to topics

get_non_duplicate_pdfs()

Get the indices of duplicated documents

get_optimk()

Display the optimum number of cluster for a given clustering method

get_pdf_files()

Get the list of files in a given directory

get_ps()

Get the parameter set to tune to for a given learner

get_relevantCountries()

Extract countries names to be searched for in the author keywords

get_rootdir()

Convenience functions to allow for inter-operability between different systems

get_samples()

Select the documents for downloading while correcting for bias in terms of year and sources

get_scopusAbstract()

Parse htlm page from Scopus to retrieve the abstract of a document

get_short_long_term_pred()

Legacy function to assess performance of learner that learned the fine temporal scale against the aggregated temporal scale

get_titleDocs()

Read topic model file titles

get_titleInd()

This function performs the cross-walk between human reading databases and topic model databases using document titles

get_topColors()

Get the dominant colors from country flags The country flag are extracted from http://hjnilsson.github.io/country-flags/

get_topicDocs()

Read topic model data

get_topic_distance()

Get the Jensen-Shannon distance across topics

get_tuningPar()

Retrieve the most frequent best tuned hyper-parameters Ties are broken by taking the mininum

get_validationHumanReading()

Read human reading database

get_webscrapped_trainingLabels()

read webscrapped training labels

get_webscrapped_validationDTM()

read training data (document-term matrix corresponding to webscrapped labels)

get_wosAbstract()

Use Elsevier API to retrieve the abstract of a document

get_wosAuthKeywords()

Use Elsevier API to retrieve the author keywords of a document

get_wosFullResult()

Use DOI to extract metadata

get_wrappedLearnersList()

Create a list of wrapped learners

join_database_shapefile()

Assign the data from the country database to the country shapefile

make_AUCPlot()

Create a comparison plot of AUC

make_corrmatrix()

Make a correlation matrix

make_countryNetwork()

Creates a weighted adjacency matrix of probability of citation between countries

make_country_tokens()

Tokenize labels, here a list of relevant countries

make_dendrogram()

Make a dendrogram

make_df_docs()

Creates a subset of the LDA topics dataframe into theme dataframe

make_hist()

Makes a density plot of a selected attribute

make_humanReadingTrainingLabels()

Create training labels from aligned human reading database

make_map()

Makes a map for a give attribute

make_predictions()

Make randomForest prediction

make_pretty_str()

This function takes care of some formatting issue that appeared in the process of aligning database coming from the query and from EndNote. The function removes alpha-numeric characters, some special characters, trim whitespaces and concatenate them.

make_targetData()

Create the target data to predict from

make_task()

Make an MLR task

make_topicNetwork()

Creates a weighted adjacency matrix of probability of citation between topics

make_trainingData()

Create training data for a multilabel classification

make_trainingDataMulticlass()

Create training data for a multiclass classification

make_webscrapped_trainingData()

Create a data.frame of multilabels using the webscrapped author keywords

melt_df_country()

Manually melt a data.frame

multiclassBenchmark()

Perform a benchmark between non-tuned algorithm adaptation and multilabel and binary relevance wrappers

multilabelBenchmark()

Perform a benchmark between algorithm adaptation and multilabel and binary relevance wrappers

normalize_adj_matrix()

Normalize an adjacency matrix by its rows (i.e., "from")

order_data()

Order data to LDA order

plot_df_country()

Plot each variables in a melted data.frame in a faceted ggplot

plot_estimates()

Plot a time estimate matrix for a different number of downlaoders

print_estimate()

Print the time estimate for manual downloading

query_QA_plots()

Make QA/QC plot for a given data.frame by comparing corpus present in the query and actually collected

read_citation_dataframe()

Read the citation data.frame exported by the query processing

read_countries_database()

Read the country database file

read_countries_shapefile()

Read the country shapefile

read_survey_df()

Read the survey data

reduce_docs_for_JSd()

Reduce the documents before calculating the Jensen-Shannon distance

remove_country()

Remove country column

remove_irrelevant()

Removes Irrelevant entries from the country column of a data.frame

remove_year()

Remove year column

remove_year_country()

Remove year and country columns

rgb2hex()

Convert rgb to hex colors

select_data()

Select the type of data to display

select_list()

Convenience function for the shinyApp

source_plot()

Produce a barplot of the different sources present in the query and in the collected corpus

transform_DTM()

This function formats the DTM to be used by the ML models In particular, merge together terms corresponding to one country, e.g. costa and rica

transform_data()

Transform the data with centering, scaling and Box-Cox transformations

update_database()

Update the database by marking the manuall downloaded articles as in corpus

write_citation_dataframe()

Save a data.frame.