|
EDA_trainingData()
|
Performs a simple visualization of multilabel training data using mldr package |
|
PerfVisMultilabel()
|
Make a performance plot for multilabel classification |
|
QA.AuthKeywords()
|
Quality analysis on the author keywords retrieved |
|
QA_EndNoteIdCorpusLDA()
|
Performs a quick QA on both sources for document ID |
|
QA_alignedData()
|
Perform QA/QC on aligned data |
|
QA_oldXnewPredictions()
|
this function performs a quick quality analysis and difference with historical predictions |
|
VizSpots()
|
Produces a chord diagram visualization with country cluster colors and topics categories |
|
add_abstractsToCorpus()
|
Add retrieved abstract to corpus |
|
align_dataWithEndNoteIdLDA()
|
Align databases based on shared ID |
|
align_dataWithEndNoteIdcorpus()
|
Align databases based on shared ID |
|
align_englishCorpus()
|
Align englishCorpus with matching records in both databases and assign webscrapped abstracts |
|
align_humanReadingTopicModel()
|
identifies the subset of paper with validation data and align databases |
|
article_selection()
|
Randomly select articles for human-reading |
|
assign_articles_to_players()
|
Create a .csv files with the information to download the documents |
|
assign_articles_to_readers()
|
Chunk the selected documents between a given number of human readers and copy them into a given directory |
|
check_duplicate_row()
|
Return the unique rows of a data.frame. |
|
check_duplicate_title()
|
Return a data.frame with unique Titles. |
|
clStab()
|
Evaluates the cluster stability |
|
consolidate_LDA_results()
|
Consolidates lda results by adding year and country prediction to the topicDocs |
|
count_nas()
|
Count the number of NA |
|
diversity_LAC()
|
Calculates a diversity over the entire LAC region |
|
diversity_country()
|
Calculates the diversity |
|
filter_by_country()
|
Filter rows of a data.frame by country |
|
filter_columns()
|
Filter columns of a data.frame |
|
filter_dfm()
|
Filter the complete document-feature matrix to retain all features with occurence higher than the lowest occurence of country tokens.
This function mainly serves to limit the size of the document-feature matrix |
|
fix_names()
|
Fix the format of the document names from an existing database |
|
format_data4get_ps()
|
Format the data to the format expected by get_ps |
|
generate_label_df()
|
General a data.frame of labels for a violin plot |
|
get_DocTermMatrix()
|
read document-term matrix created by the text mining code |
|
get_EndNoteIdLDA()
|
get document ID from LDA corpus database |
|
get_EndNoteIdcorpus()
|
get document ID from EndNote query corpus database |
|
get_JSd_corpus()
|
Calculates the Jensen-Shannon distance for countries |
|
get_JSd_country()
|
Calculates the Jensen-Shannon distance for countries |
|
get_MLDR()
|
internal function to get MLDR |
|
get_allAuthKeywords()
|
Extracts all author keywords from the metadata results |
|
get_allMetadata()
|
Extract metadata from Scopus or Web of Science identifiers |
|
get_binaryRelevanceLearners()
|
Convience legacy function to create binary relance wrappers from MLR |
|
get_boolean_AuthKeywords()
|
Transform the author keywords into a multilabel dataset |
|
get_chainingOrder()
|
Get chaining order from MLDR |
|
get_citation_dataframe()
|
Read .csv files to create a citation data.frame |
|
get_countries()
|
Extract the list of possible country |
|
get_country_distance()
|
Get the Jensen-Shannon distance across countries |
|
get_csv_files()
|
List .csv files in a directory |
|
get_dfm()
|
Retrieve or create a document-feature matrix (dfm) from hard coded options relevant to current project |
|
get_dtm()
|
Get document-term matrix from a document-feature matrix and a list of tokens |
|
get_endnote_titles()
|
Retrieve the titles of the documents in the English corpus |
|
get_endnote_xml()
|
Parse the .xml database from EndNote |
|
get_ind_hasCountryTag()
|
Identifies if a document has a tag |
|
get_language()
|
Interactive prompt to select language |
|
get_language_dfs()
|
Read the citation data frame and store them into a named list |
|
get_mail()
|
Extract email addresses from pdf documents
You probably want to execute this code on a linux server to avoid the issues with special character handling on Windows |
|
get_meta_df()
|
Binds the separate languages data.frame into a meta data.frame |
|
get_n_players()
|
Prompt to get the number of players |
|
get_network()
|
Creates the adjencency matrix of a bi-partite network: country to topics |
|
get_non_duplicate_pdfs()
|
Get the indices of duplicated documents |
|
get_optimk()
|
Display the optimum number of cluster for a given clustering method |
|
get_pdf_files()
|
Get the list of files in a given directory |
|
get_ps()
|
Get the parameter set to tune to for a given learner |
|
get_relevantCountries()
|
Extract countries names to be searched for in the author keywords |
|
get_rootdir()
|
Convenience functions to allow for inter-operability between different systems |
|
get_samples()
|
Select the documents for downloading while correcting for bias in terms of year and sources |
|
get_scopusAbstract()
|
Parse htlm page from Scopus to retrieve the abstract of a document |
|
get_short_long_term_pred()
|
Legacy function to assess performance of learner that learned the fine temporal scale against the aggregated temporal scale |
|
get_titleDocs()
|
Read topic model file titles |
|
get_titleInd()
|
This function performs the cross-walk between human reading databases and topic model databases using document titles |
|
get_topColors()
|
Get the dominant colors from country flags
The country flag are extracted from http://hjnilsson.github.io/country-flags/ |
|
get_topicDocs()
|
Read topic model data |
|
get_topic_distance()
|
Get the Jensen-Shannon distance across topics |
|
get_tuningPar()
|
Retrieve the most frequent best tuned hyper-parameters
Ties are broken by taking the mininum |
|
get_validationHumanReading()
|
Read human reading database |
|
get_webscrapped_trainingLabels()
|
read webscrapped training labels |
|
get_webscrapped_validationDTM()
|
read training data (document-term matrix corresponding to webscrapped labels) |
|
get_wosAbstract()
|
Use Elsevier API to retrieve the abstract of a document |
|
get_wosAuthKeywords()
|
Use Elsevier API to retrieve the author keywords of a document |
|
get_wosFullResult()
|
Use DOI to extract metadata |
|
get_wrappedLearnersList()
|
Create a list of wrapped learners |
|
join_database_shapefile()
|
Assign the data from the country database to the country shapefile |
|
make_AUCPlot()
|
Create a comparison plot of AUC |
|
make_corrmatrix()
|
Make a correlation matrix |
|
make_countryNetwork()
|
Creates a weighted adjacency matrix of probability of citation between countries |
|
make_country_tokens()
|
Tokenize labels, here a list of relevant countries |
|
make_dendrogram()
|
Make a dendrogram |
|
make_df_docs()
|
Creates a subset of the LDA topics dataframe into theme dataframe |
|
make_hist()
|
Makes a density plot of a selected attribute |
|
make_humanReadingTrainingLabels()
|
Create training labels from aligned human reading database |
|
make_map()
|
Makes a map for a give attribute |
|
make_predictions()
|
Make randomForest prediction |
|
make_pretty_str()
|
This function takes care of some formatting issue that appeared in the process of aligning database coming from the query and from EndNote.
The function removes alpha-numeric characters, some special characters, trim whitespaces and concatenate them. |
|
make_targetData()
|
Create the target data to predict from |
|
make_task()
|
Make an MLR task |
|
make_topicNetwork()
|
Creates a weighted adjacency matrix of probability of citation between topics |
|
make_trainingData()
|
Create training data for a multilabel classification |
|
make_trainingDataMulticlass()
|
Create training data for a multiclass classification |
|
make_webscrapped_trainingData()
|
Create a data.frame of multilabels using the webscrapped author keywords |
|
melt_df_country()
|
Manually melt a data.frame |
|
multiclassBenchmark()
|
Perform a benchmark between non-tuned algorithm adaptation and multilabel and binary relevance wrappers |
|
multilabelBenchmark()
|
Perform a benchmark between algorithm adaptation and multilabel and binary relevance wrappers |
|
normalize_adj_matrix()
|
Normalize an adjacency matrix by its rows (i.e., "from") |
|
order_data()
|
Order data to LDA order |
|
plot_df_country()
|
Plot each variables in a melted data.frame in a faceted ggplot |
|
plot_estimates()
|
Plot a time estimate matrix for a different number of downlaoders |
|
print_estimate()
|
Print the time estimate for manual downloading |
|
query_QA_plots()
|
Make QA/QC plot for a given data.frame by comparing corpus present in the query and actually collected |
|
read_citation_dataframe()
|
Read the citation data.frame exported by the query processing |
|
read_countries_database()
|
Read the country database file |
|
read_countries_shapefile()
|
Read the country shapefile |
|
read_survey_df()
|
Read the survey data |
|
reduce_docs_for_JSd()
|
Reduce the documents before calculating the Jensen-Shannon distance |
|
remove_country()
|
Remove country column |
|
remove_irrelevant()
|
Removes Irrelevant entries from the country column of a data.frame |
|
remove_year()
|
Remove year column |
|
remove_year_country()
|
Remove year and country columns |
|
rgb2hex()
|
Convert rgb to hex colors |
|
select_data()
|
Select the type of data to display |
|
select_list()
|
Convenience function for the shinyApp |
|
source_plot()
|
Produce a barplot of the different sources present in the query and in the collected corpus |
|
transform_DTM()
|
This function formats the DTM to be used by the ML models
In particular, merge together terms corresponding to one country, e.g. costa and rica |
|
transform_data()
|
Transform the data with centering, scaling and Box-Cox transformations |
|
update_database()
|
Update the database by marking the manuall downloaded articles as in corpus |
|
write_citation_dataframe()
|
Save a data.frame. |