Purpose

This vignette documents the training part of the ML pipeline which performs a benchmark of multiple tasks with possibly multiple learners and feature selection.

Libraries

library(RiverML)

Initialization

Parameters

Variables in all caps represent the scripts parameters.

variable name	variable description	type
`FINAL`	selects the type of runs, if `FALSE` trains, else predict	`logical`
`PATH`	path to the output directory	`character`
`REGIONS`	vector of `region` identifiers (e.g. `"SFE"`)	`character`
`LRN_IDS`	vector of learner identifiers (e.g. `"classif.randomForest"`)	`character`
`PREPROC`	vector of preprocessing identifiers (e.g. `"scale"`)	`character`
`TUNELENGTH`	controls the number of hyper-parameters for discrete tuning	`numeric`
`ITERS`	controls the number of tries for the random tuning	`numeric`
`PROB`	selects the type of output, if `TRUE` probabilities, if `FALSE` response	`logical`
`FS`	if `TRUE` activates the feature selection	`logical`
`FS_NUM`	number of features to select if `FS` is `TRUE`	`numeric`
`TUNE_FS`	if `TRUE` activates the feature selection tuning	`logical`
`FS_NUM_LIST`	list of number of features to select if `FS_TUNE` is `TRUE`	`numeric`
`NU`	number of folds for the \(\nu\)-fold cross-validation	`numeric`
`REPS`	number of repetitions for the repeated \(\nu\)-fold cross-validation	`numeric`
`SPCV`	if `TRUE` activates the spatial cross-validation	`logical`
`INNER`	defines the inner folds for the nested resampling	`ResampleDesc`
`MES`	list of measures from the `mlr` package, the first one is optimized against	`list`
`INFO`	controls the information printed by the training process	`logical`

# TYPE OF RUN
FINAL <- FALSE

#PATH
PATH <- "F:/hguillon/research/FS_MI_corr/10CV10-CV10_benchmarks_au1u_FS"

# TASKS
REGIONS <- c("SFE", "ALLSAC", "NC", "NCC", "SCC", "SC", "K", "SJT", "SECA")

# LEARNERS
LRN_IDS <- c(
    "classif.nnTrain",
    "classif.svm",
    "classif.randomForest"
)

# PREPROCESSING
PREPROC <- c("zv", "center", "scale", "medianImpute")

# TUNING
TUNELENGTH <- 16 # multiple of 8 please
ITERS <- TUNELENGTH * 2

# OUTPUT
PROB <- TRUE

# FEATURE SELECTION
FS <- TRUE
FS_NUM <- 20
TUNE_FS <- TRUE
FS_NUM_LIST <- seq(2, 50, by = 1)

# RESAMPLING
NU <- 10
REPS <- 10
SPCV <- FALSE
INNER <- get_inner(FINAL, NU, REPS, SPCV)

# OPTIMIZATION MEASURE
MES <- list(mlr::multiclass.au1u, mlr::acc, mlr::mmce, mlr::multiclass.aunu, mlr::timetrain)
print(paste("Optimizing against:", MES[[1]]$id))
#> [1] "Optimizing against: multiclass.au1u"
INFO <- FALSE

Fetching benchmark data if `FINAL` is `TRUE`

If the benchmark has already been run (with FINAL <- FALSE), we can retrieve the tuning results with get_bestBMR_tuning_results() which can be used to retrieve tuning and tuning entropy results.

if (FINAL){
    bestBMR_tune <- get_bestBMR_tuning_results("./data/FS/", "bestBMR_tune.Rds", REGIONS, LRN_IDS)
    bestBMR_lrnH <- get_bestBMR_tuning_results("./data/FS/", "bestBMR_lrnH.Rds", REGIONS, LRN_IDS)
}

These data are used to train the final models with the optimal number of parameters and the most stable value of tuning parameters.

Starting the benchmark

Once the parameters have been correctly defined, the benchmark is handled with regional_benchmark(). The savePARAMETERS() documents the parameters with which the benchmark is run.

if (TUNE_FS){
    .PATH <- PATH
    FS_NUM <- NULL

    if (FINAL){
        FS_NUM_LIST <- bestBMR_lrnH %>% dplyr::pull(FS_NUM) %>% unique() %>% sort()
    }

    for (FS_NUM in FS_NUM_LIST){

        PATH <- paste(.PATH, FS_NUM, sep = "_")

        if (!dir.exists(PATH)) dir.create(PATH)
        savePARAMETERS(PATH, FINAL)

        ### BENCHMARK ###
        bmrs <- regional_benchmark(regions = REGIONS,
            LRN_IDS, TUNELENGTH, INNER, ITERS, PROB, NU, REPS,
            PREPROC, FINAL, PATH, REDUCED, MES, INFO, FS, FS_NUM)
        names(bmrs) <- REGIONS
    }
} else {

    if (!dir.exists(PATH)) dir.create(PATH)
    savePARAMETERS(PATH, FINAL)

    ### BENCHMARK ###
    bmrs <- regional_benchmark(regions = REGIONS,
        LRN_IDS, TUNELENGTH, INNER, ITERS, PROB, NU, REPS,
        PREPROC, FINAL, PATH, REDUCED, MES, INFO, FS, FS_NUM)
    names(bmrs) <- REGIONS
}

Behind the scenes of `regional_benchmark()`

regional_benchmark() is a wrapper function calling a number of functions. Here is a some pseudo-code that explains what is happening behind the scenes.

Skip. Because regional_benchmark() is called inside the for-loop for (FS_NUM in FS_NUM_LIST) (see above), if FINAL is TRUE, regional_benchmark() skips the region it does not need to calculate the final models.
Data loading. This handled by get_training_data()
Data formatting. This is handled by fmt_labels(), sanitize_data() and get_coords().
Feature selection. If FINAL is TRUE the selected features are retrieved from get_bestBMR_tuning_results(). If FINAL is FALSE the selected features are derived from transformed training data using get_ppc() and preproc_data(). The resulting transformed data are filtered for correlation higher than 0.95 with caret::findCorrelation(). Then, 500 subsampled mlr Tasks are created with mlr::makeResampleDesc(), mlr::makeClassifTask(), mlr::makeResampleInstance() and mlr::filterFeatures(). The FS_NUM most commonly select features across the 500 realizations are selected.
Pre-processing. The target and training data are transformed using get_ppc() on the target data and preproc_data() on the training data. SMOTE is applied using get_smote_data() and get_smote_coords() which both call resolve_class_imbalance().
Tasks. Tasks are obtained using mlr::makeClassifTask().
Learners. Learners are constructed using get_learners() or get_final_learners().
Compute benchmark. The benchmark is run with compute_final_model() or compute_benchmark() (which needs to retrieve the outer folds of the nested resampling with get_outers()).

ML training