EDA_trainingData()
|
Performs a simple visualization of multilabel training data using mldr package |
PerfVisMultilabel()
|
Make a performance plot for multilabel classification |
QA.AuthKeywords()
|
Quality analysis on the author keywords retrieved |
QA_EndNoteIdCorpusLDA()
|
Performs a quick QA on both sources for document ID |
QA_alignedData()
|
Perform QA/QC on aligned data |
QA_oldXnewPredictions()
|
this function performs a quick quality analysis and difference with historical predictions |
VizSpots()
|
Produces a chord diagram visualization with country cluster colors and topics categories |
add_abstractsToCorpus()
|
Add retrieved abstract to corpus |
align_dataWithEndNoteIdLDA()
|
Align databases based on shared ID |
align_dataWithEndNoteIdcorpus()
|
Align databases based on shared ID |
align_englishCorpus()
|
Align englishCorpus with matching records in both databases and assign webscrapped abstracts |
align_humanReadingTopicModel()
|
identifies the subset of paper with validation data and align databases |
article_selection()
|
Randomly select articles for human-reading |
assign_articles_to_players()
|
Create a .csv files with the information to download the documents |
assign_articles_to_readers()
|
Chunk the selected documents between a given number of human readers and copy them into a given directory |
check_duplicate_row()
|
Return the unique rows of a data.frame. |
check_duplicate_title()
|
Return a data.frame with unique Titles. |
clStab()
|
Evaluates the cluster stability |
consolidate_LDA_results()
|
Consolidates lda results by adding year and country prediction to the topicDocs |
count_nas()
|
Count the number of NA |
diversity_LAC()
|
Calculates a diversity over the entire LAC region |
diversity_country()
|
Calculates the diversity |
filter_by_country()
|
Filter rows of a data.frame by country |
filter_columns()
|
Filter columns of a data.frame |
filter_dfm()
|
Filter the complete document-feature matrix to retain all features with occurence higher than the lowest occurence of country tokens.
This function mainly serves to limit the size of the document-feature matrix |
fix_names()
|
Fix the format of the document names from an existing database |
format_data4get_ps()
|
Format the data to the format expected by get_ps |
generate_label_df()
|
General a data.frame of labels for a violin plot |
get_DocTermMatrix()
|
read document-term matrix created by the text mining code |
get_EndNoteIdLDA()
|
get document ID from LDA corpus database |
get_EndNoteIdcorpus()
|
get document ID from EndNote query corpus database |
get_JSd_corpus()
|
Calculates the Jensen-Shannon distance for countries |
get_JSd_country()
|
Calculates the Jensen-Shannon distance for countries |
get_MLDR()
|
internal function to get MLDR |
get_allAuthKeywords()
|
Extracts all author keywords from the metadata results |
get_allMetadata()
|
Extract metadata from Scopus or Web of Science identifiers |
get_binaryRelevanceLearners()
|
Convience legacy function to create binary relance wrappers from MLR |
get_boolean_AuthKeywords()
|
Transform the author keywords into a multilabel dataset |
get_chainingOrder()
|
Get chaining order from MLDR |
get_citation_dataframe()
|
Read .csv files to create a citation data.frame |
get_countries()
|
Extract the list of possible country |
get_country_distance()
|
Get the Jensen-Shannon distance across countries |
get_csv_files()
|
List .csv files in a directory |
get_dfm()
|
Retrieve or create a document-feature matrix (dfm) from hard coded options relevant to current project |
get_dtm()
|
Get document-term matrix from a document-feature matrix and a list of tokens |
get_endnote_titles()
|
Retrieve the titles of the documents in the English corpus |
get_endnote_xml()
|
Parse the .xml database from EndNote |
get_ind_hasCountryTag()
|
Identifies if a document has a tag |
get_language()
|
Interactive prompt to select language |
get_language_dfs()
|
Read the citation data frame and store them into a named list |
get_mail()
|
Extract email addresses from pdf documents
You probably want to execute this code on a linux server to avoid the issues with special character handling on Windows |
get_meta_df()
|
Binds the separate languages data.frame into a meta data.frame |
get_n_players()
|
Prompt to get the number of players |
get_network()
|
Creates the adjencency matrix of a bi-partite network: country to topics |
get_non_duplicate_pdfs()
|
Get the indices of duplicated documents |
get_optimk()
|
Display the optimum number of cluster for a given clustering method |
get_pdf_files()
|
Get the list of files in a given directory |
get_ps()
|
Get the parameter set to tune to for a given learner |
get_relevantCountries()
|
Extract countries names to be searched for in the author keywords |
get_rootdir()
|
Convenience functions to allow for inter-operability between different systems |
get_samples()
|
Select the documents for downloading while correcting for bias in terms of year and sources |
get_scopusAbstract()
|
Parse htlm page from Scopus to retrieve the abstract of a document |
get_short_long_term_pred()
|
Legacy function to assess performance of learner that learned the fine temporal scale against the aggregated temporal scale |
get_titleDocs()
|
Read topic model file titles |
get_titleInd()
|
This function performs the cross-walk between human reading databases and topic model databases using document titles |
get_topColors()
|
Get the dominant colors from country flags
The country flag are extracted from http://hjnilsson.github.io/country-flags/ |
get_topicDocs()
|
Read topic model data |
get_topic_distance()
|
Get the Jensen-Shannon distance across topics |
get_tuningPar()
|
Retrieve the most frequent best tuned hyper-parameters
Ties are broken by taking the mininum |
get_validationHumanReading()
|
Read human reading database |
get_webscrapped_trainingLabels()
|
read webscrapped training labels |
get_webscrapped_validationDTM()
|
read training data (document-term matrix corresponding to webscrapped labels) |
get_wosAbstract()
|
Use Elsevier API to retrieve the abstract of a document |
get_wosAuthKeywords()
|
Use Elsevier API to retrieve the author keywords of a document |
get_wosFullResult()
|
Use DOI to extract metadata |
get_wrappedLearnersList()
|
Create a list of wrapped learners |
join_database_shapefile()
|
Assign the data from the country database to the country shapefile |
make_AUCPlot()
|
Create a comparison plot of AUC |
make_corrmatrix()
|
Make a correlation matrix |
make_countryNetwork()
|
Creates a weighted adjacency matrix of probability of citation between countries |
make_country_tokens()
|
Tokenize labels, here a list of relevant countries |
make_dendrogram()
|
Make a dendrogram |
make_df_docs()
|
Creates a subset of the LDA topics dataframe into theme dataframe |
make_hist()
|
Makes a density plot of a selected attribute |
make_humanReadingTrainingLabels()
|
Create training labels from aligned human reading database |
make_map()
|
Makes a map for a give attribute |
make_predictions()
|
Make randomForest prediction |
make_pretty_str()
|
This function takes care of some formatting issue that appeared in the process of aligning database coming from the query and from EndNote.
The function removes alpha-numeric characters, some special characters, trim whitespaces and concatenate them. |
make_targetData()
|
Create the target data to predict from |
make_task()
|
Make an MLR task |
make_topicNetwork()
|
Creates a weighted adjacency matrix of probability of citation between topics |
make_trainingData()
|
Create training data for a multilabel classification |
make_trainingDataMulticlass()
|
Create training data for a multiclass classification |
make_webscrapped_trainingData()
|
Create a data.frame of multilabels using the webscrapped author keywords |
melt_df_country()
|
Manually melt a data.frame |
multiclassBenchmark()
|
Perform a benchmark between non-tuned algorithm adaptation and multilabel and binary relevance wrappers |
multilabelBenchmark()
|
Perform a benchmark between algorithm adaptation and multilabel and binary relevance wrappers |
normalize_adj_matrix()
|
Normalize an adjacency matrix by its rows (i.e., "from") |
order_data()
|
Order data to LDA order |
plot_df_country()
|
Plot each variables in a melted data.frame in a faceted ggplot |
plot_estimates()
|
Plot a time estimate matrix for a different number of downlaoders |
print_estimate()
|
Print the time estimate for manual downloading |
query_QA_plots()
|
Make QA/QC plot for a given data.frame by comparing corpus present in the query and actually collected |
read_citation_dataframe()
|
Read the citation data.frame exported by the query processing |
read_countries_database()
|
Read the country database file |
read_countries_shapefile()
|
Read the country shapefile |
read_survey_df()
|
Read the survey data |
reduce_docs_for_JSd()
|
Reduce the documents before calculating the Jensen-Shannon distance |
remove_country()
|
Remove country column |
remove_irrelevant()
|
Removes Irrelevant entries from the country column of a data.frame |
remove_year()
|
Remove year column |
remove_year_country()
|
Remove year and country columns |
rgb2hex()
|
Convert rgb to hex colors |
select_data()
|
Select the type of data to display |
select_list()
|
Convenience function for the shinyApp |
source_plot()
|
Produce a barplot of the different sources present in the query and in the collected corpus |
transform_DTM()
|
This function formats the DTM to be used by the ML models
In particular, merge together terms corresponding to one country, e.g. costa and rica |
transform_data()
|
Transform the data with centering, scaling and Box-Cox transformations |
update_database()
|
Update the database by marking the manuall downloaded articles as in corpus |
write_citation_dataframe()
|
Save a data.frame. |