library(RiverML)
library(magrittr)

Merging benchmark results

The output files from the HPC are organized with one folder per learner and sub-folders for regions. The following snippet merges all the benchmark results using getAllBMRS().

PATH <- "F:/hguillon/research/regional_benchmark/"
dirs <- list.dirs(PATH, recursive = FALSE, full.names = FALSE)

bmrs <- lapply(dirs, function(dir){getAllBMRS(file.path(PATH, dir), pattern = paste0(dir, "_"))})
bmrs_perf <- lapply(bmrs, function(bmr) bmr$perf)
bmrs_tune <- lapply(bmrs[dirs != "baseline"], function(bmr) bmr$tune  %>% distinct())

BMR_res <- list()
BMR_res$perf <- do.call(rbind, bmrs_perf)
BMR_res$tune <-  bmrs_tune[[1]]
for (i in seq_along(bmrs_tune)[-1]) BMR_res$tune <- full_join(BMR_res$tune, bmrs_tune[[i]])

The associated results in BMR_res are provided with the package.

data("BMR_res")
BMR_perf <- getBMR_perf_tune(BMR_res, type = "perf")
BMR_perf %>% dplyr::sample_n(10) %>%
    knitr::kable(digits = 3, format = "html", caption = "benchmark results") %>%
    kableExtra::kable_styling(bootstrap_options = c("hover", "condensed")) %>%
    kableExtra::scroll_box(width = "7in", height = "5in")
benchmark results
task.id learner.id iter multiclass.au1u acc mmce multiclass.aunu timetrain FS_NUM
NC randomForest 100 0.968 0.853 0.147 0.967 130.775 45
SECA naiveBayes 15 0.950 0.667 0.333 0.871 0.040 11
SAC naiveBayes 28 0.836 0.255 0.745 0.759 0.010 5
K kknn 19 0.944 0.848 0.152 0.939 0.000 6
SC nnTrain 54 0.819 0.462 0.538 0.784 1205.046 6
SJT kknn 38 0.992 0.750 0.250 0.992 0.010 8
K naiveBayes 18 0.904 0.656 0.344 0.905 0.010 10
NC randomForest 21 0.939 0.794 0.206 0.941 64.493 15
SFE featureless 50 0.500 0.095 0.905 0.500 0.030 30
SC kknn 88 0.858 0.615 0.385 0.849 0.030 18

Processing benchmark results

Performance plots

makeAverageAUCPlot(BMR_perf) + ggplot2::scale_x_continuous(breaks = c(5, 25, 50))

makeAverageAccPlot(BMR_perf) + ggplot2::scale_x_continuous(breaks = c(5, 25, 50))

makeTotalTimetrainPlot(BMR_perf) + ggplot2::scale_x_continuous(breaks = c(5, 25, 50))

Model selection

The model selection is handled by examining statistical differences between the distribution of performance metrics for successive iterations of model training with an increasing number of predictors. On the following graph, the brackets indicate statistically significant differences between the distribution assessed by a Dunn test for the SAC region and the randomForest learner.

makeExampleModelSelectionPlot(BMR_perf, lrn = c("randomForest"), window_size = 7)[["SAC"]]

The p-value is internally ajusted to take into account multiple comparison with a Bonferroni correction. The search window influence (window_size) is examined with makeWindowInfluencePlot() which requires use to declare where we are storing the names of the selected features for each iterations.

selectedPATH <- system.file("extdata/selectedFeatures/", package = "RiverML")
makeWindowInfluencePlot(BMR_perf, selectedPATH, lrn = "randomForest") +
    ggplot2::xlim(c(0,20))

The window size selection is mainly based on the evolution of the number of optimal predictors per classes in a given region of study with respect to the window size. This balances a trade-off between two effects. A large window increases the likelihood to find a statistically different performance distribution is high and selects a large number of predictors. A small window decreases the likelihood to find statistically significant difference but might miss the inclusion of an important predictor leading to a significant increase in performance. In our example above, with a window size of 8, there is a major jump in the number of predictors per class which increases the likelihood of overfit for the optimal models.

Feature importance

The optimal feature sets are obtained for each learner by using get_bestFeatureSets().

bestFeatureSets <- get_bestFeatureSets(BMR_perf, selectedPATH, window_size = 7)

A convenience function, makeAllFeatureImportancePlotsFS(), creates all the feature importance plots from the standpoint of feature selection, that is evaluating which feature is present in the optimal feature sets. Here is an example for the SFE region.

FeatureImportancePlots <- makeAllFeatureImportancePlotFS(BMR_perf, selectedPATH, bestFeatureSets)
FeatureImportancePlots[["SFE"]]

We can also investigate which features are often selected across all regions with getFreqBestFeatureSets().

freq_bestFeatureSets <- getFreqBestFeatureSets(bestFeatureSets)
freq_bestFeatureSets %>%
    dplyr::rename(var = group, Overall = prop) %>%
    dplyr::arrange(-Overall) %>%
    makeFeatureImportancePlot(first = 20) +
    ggplot2::labs(title = "Top 20 predictors in optimal RF models", y = "variable importance across all areas of study")

Performance of optimal models

We now turn to analyzing the tuning result contained in BMR_tune.

BMR_tune <- getBMR_perf_tune(BMR_res, type = "tune")
BMR_tune %>% dplyr::sample_n(10) %>%
    knitr::kable(digits = 3, format = "html", caption = "benchmark results") %>%
    kableExtra::kable_styling(bootstrap_options = c("hover", "condensed")) %>%
    kableExtra::scroll_box(width = "7in", height = "5in")
benchmark results
task.id learner.id iter max.number.of.layers hidden activationfun output numepochs learningrate batchsize momentum hidden_dropout visible_dropout multiclass.au1u acc mmce multiclass.aunu timetrain FS_NUM mtry cost kernel
SFE svm 22 NA NA NA NA NA NA NA NA NA NA 0.963 0.841 0.159 0.959 0.085 41 NA 0.25 linear
SC svm 65 NA NA NA NA NA NA NA NA NA NA 0.928 0.764 0.236 0.910 0.037 18 NA 4.00 linear
NCC svm 7 NA NA NA NA NA NA NA NA NA NA 0.954 0.773 0.227 0.943 0.116 42 NA 1024.00 linear
SC randomForest 4 NA NA NA NA NA NA NA NA NA NA 0.866 0.592 0.408 0.819 0.083 3 3 NA NA
NCC randomForest 61 NA NA NA NA NA NA NA NA NA NA 0.956 0.732 0.268 0.939 0.300 25 6 NA NA
SCC randomForest 35 NA NA NA NA NA NA NA NA NA NA 0.934 0.669 0.331 0.916 0.142 4 2 NA NA
SECA randomForest 34 NA NA NA NA NA NA NA NA NA NA 0.954 0.753 0.247 0.936 0.104 18 6 NA NA
SCC svm 49 NA NA NA NA NA NA NA NA NA NA 0.918 0.568 0.432 0.873 1.000 11 NA 256.00 linear
SFE nnTrain 80 2 30 tanh softmax 20 0.05 16 0.8 0 0 0.969 0.812 0.188 0.960 0.806 18 NA NA NA
SECA svm 9 NA NA NA NA NA NA NA NA NA NA 0.902 0.470 0.530 0.830 0.096 3 NA 512.00 linear

This data is subsetted by the optimal model selection with getBestBMRTune(). The resulting data.frame can be visualized with makeBestTuneAUCPlot() to compare the results of the optimal models.

bestBMR_tune <- getBestBMRTune(BMR_tune, bestFeatureSets)
makeBestTuneAUCPlot(bestBMR_tune)

Tuning entropy

Because the benchmark used nested resampling, we have access to a distribution of best tune hyper-parameters which can be visualized with getTunePlot().

getTunePlot(bestBMR_tune, "nnTrain")

In addition, while the previous plot focused on the optimal, we can investigate how the tuning entropy evolves with the number of selected features using getBMRTuningEntropy(), getBestBMRTuningEntropy(), and makeTuningEntropyPlot(). Here is one example for the SFE region.

TUNELENGTH <- 16
BMR_lrnH <- getBMRTuningEntropy(BMR_tune, TUNELENGTH)
bestBMR_lrnH <- getBestBMRTuningEntropy(BMR_lrnH, bestFeatureSets, bestBMR_tune)
p_Hnorms <- makeTuningEntropyPlot(BMR_lrnH, bestBMR_lrnH)
p_Hnorms[["SFE"]]