Purpose

Once final optimal models have been trained, predictions can be made.

Libraries

library(RiverML)
library(magrittr)
library(raster)
#> Loading required package: sp
#> 
#> Attaching package: 'raster'
#> The following object is masked from 'package:magrittr':
#> 
#>     extract

Initialization

The first step in making predictions is retrieving the trained models and parameters from the benchmark. As it is likely that the users are running more than one iterations of one benchmark (or in general more than one benchmark), this is handled by the PATH variable. This is similar to the PATH specified when training the models.

The PATH directory should contains a PARAMETERS.Rds corresponding to the final model parameters. We can retrieve the parameters we need from it with:

PARAMETERS <- readRDS(file.path(PATH, paste0("PARAMETERS.Rds")))
PROB <- PARAMETERS$PROB
PREPROC <- PARAMETERS$PREPROC
REGIONS <- PARAMETERS$REGIONS
LRN_IDS <- PARAMETERS$LRN_IDS

We also have to specify if feature selection has been activated. If so: TUNE_FS <- TRUE and we have to edit the PATH variable. Additionally we perform some cosmetic changes in the .id variables and filter for the learners that have a final model.

if (TUNE_FS){
    PATH <- "F:/hguillon/research/FS_MI_corr2"
    bestFeatureSets <- readRDS("./data/FS/bestFeatureSets.Rds") %>%
        dplyr::mutate(
            learner.id = paste0("classif.", learner.id),
            task.id = gsub("SAC", "ALLSAC", task.id)
        ) %>%
        dplyr::filter(learner.id %in% LRN_IDS)
}

Computing predictions

Uncalibrated predictions on the training dataset

Obtaining the predictions for the training dataset is handled with get_predictions() with TARGET = FALSE.

training_preds <- get_predictions(REGIONS, LRN_IDS, PATH, TARGET = FALSE, TUNE_FS)

Uncalibrated predictions on the target dataset

Obtaining the predictions for the target dataset is handled with get_predictions() with TARGET = TRUE.

target_preds <- get_predictions(REGIONS, LRN_IDS, PATH, TARGET = TRUE, TUNE_FS)

Calibrating predictions

Most machine learning models output posterior probabilities that often require calibration. Such calibration corrects for the potential distortion of the posterior probabilities when compared to empirical probabilities and improves model performance (DeGroot and Fienberg 1983; Zadrozny 2002; Niculescu-Mizil and Caruana 2005). Given the sigmoid-shape of most distortions, (Platt and others 1999) proposed a sigmoid calibration to address this effect. This is the so-called Platt’s scaling. Other useful approaches include Bayesian calibration and isotonic scaling (Zadrozny and Elkan 2002) which both require binarizing the problem. Useful references on binarization includes Platt, Cristianini, and Shawe-Taylor (2000); Kijsirikul and Ussivakul (2002); Dong, Frank, and Kramer (2005); Lorena and Carvalho (2010); Quiterio and Lorena (2016); Adnan and Islam (2015); Melnikov and Hüllermeier (2018). In this study, posterior calibration was performed using a multinomial regression; a straightforward extension of the binomial case corresponding to the logistic Platt’s scaling. We use the R package glmnet to fit a generalized linear model with an elastic net penalty and with a 10-fold cross-validation with the function get_calibrations().

The option plot = TRUE returns a calibration plot which highlights the amount of calibration per class. It is always a good sign when the predictions do not require a large amount of correction and is a qualitative check on the performance of one or multiple models. Finally, note that the calibration are performed with the original and possibly unequally sampled training dataset.

calibrations <- get_calibrations(REGIONS, LRN_IDS, training_preds, plot = TRUE)

Making calibrated predictions

Once the calibrations have been derived, calibrated_predictions() calibrates the target predictions:

cal_preds <- calibrated_predictions(REGIONS, LRN_IDS, target_preds, calibrations)

Spatially distributing the predictions

Because this step of the workflow write a number of file, it is easier to run it within a loop statement. Before we dive into the different stage of said loop, some variables have to be initialized. entropy_rate describes the instability of the predictions from a transition probability point-of-view. regional_entropy describes the instability of the predictions over the entire prediction region.

entropy_rate <- rep(list(rep(NA, length(LRN_IDS))), length(REGIONS))
names(entropy_rate) <- REGIONS
for (region in REGIONS) names(entropy_rate[[region]]) <- LRN_IDS

regional_entropy <- rep(list(rep(NA, length(LRN_IDS) + 1)), length(REGIONS))
names(regional_entropy) <- REGIONS
for (region in REGIONS) names(regional_entropy[[region]]) <- c("training", LRN_IDS)

We can then enter the loop over area of studies:

for (region in REGIONS){

First we retrieve the target streamlines with:

.target_streamlines <- get_target_streamlines(region)

Then we enter the loop over the ML models:

    for (lrn in LRN_IDS){
        target_streamlines <- .target_streamlines
        predictions_df <- as.data.frame(cal_preds[[region]][[lrn]])

If there are post-hoc heuristics, this is where one would implement them. For example, one heuristic could be to assign a class based on the Strahler’s order, here the class K03 in region K.

        if (region == "K"){
            predictions_df[which(target_streamlines$StreamOrde >= 5), grepl("K03", colnames(predictions_df))] <- 1
            rSum <- rowSums(predictions_df)
            predictions_df <- sweep(predictions_df, 1, rSum,'/')
        }

A good check to implement is to ensure that all predictions sum to one (bar some rounding errors).

        check <- apply(predictions_df, 1, sum)
        stopifnot(all(check >= 1 - 1E-9))

We can retrieve the response as being the class with the highest probability. Another sanity check at this step is to investigate the distribution of the predicted classes over the entire region. Such a check can uncover early on (that is before any mapping) that some classes are under-represented in the predictions.

        lvls <- gsub("prob.", "", colnames(predictions_df))
        target_predictions <- apply(predictions_df, MARGIN = 1, FUN = function(row){
            lvls[which.max(as.numeric(row))]
        })
        target_predictions <- as.factor(target_predictions)
        print(table(target_predictions))

From the probabilistic predictions, we can derive a number of metrics of their distribution at the regional level which shannon_weiner(), richness() and simpson_evenness(). Again, this is useful at this stage to capture the potential difference between the distribution of the training labels and the predictions over the entire region. If the sampling of the training dataset has been correctly stratified before, one would expect that at the region level the distribution of the predictions is close to the one of the training labels. If that is not the case, that might indicate some issues with the generalization of the learned patterns.

        shannon_network_training <- shannon_weiner(training_preds[[region]]$labels, units = 2)
        richness_network_training <- richness(training_preds[[region]]$labels)
        simpson_evenness_network_training <- simpson_evenness(training_preds[[region]]$labels)

        shannon_network_target <- shannon_weiner(target_predictions, units = 2)
        richness_network_target <- richness(target_predictions)
        simpson_evenness_network_target <- simpson_evenness(target_predictions)

        metrics_df <- data_frame(training = c(richness_network_training, simpson_evenness_network_training, shannon_network_training),
            target = c(richness_network_target, simpson_evenness_network_target, shannon_network_target))
        rownames(metrics_df) <- c("Richness", "Simpson's evenness", "Shannon-Weiner's value")
        print(metrics_df)
        if (richness_network_target != richness_network_training) warning("One of the class is not predicted at all over the whole network!")

These metrics can be extended at the stream interval level, that is at the instance level.

        shannon_segments <- apply(predictions_df, MARGIN = 1, FUN = shannon_weiner, .prob_bool = TRUE, units = 2)
        richness_segments <- apply(predictions_df, MARGIN = 1, FUN = richness)
        evenness_segments <- apply(predictions_df, MARGIN = 1, FUN = simpson_evenness, .prob_bool = TRUE)

We can now assign these values to the target streamlines and save the results into a .shp shapefile.

        target_streamlines$group <- target_predictions
        target_streamlines$shannon <- shannon_segments
        target_streamlines$richness <- richness_segments
        target_streamlines$evenness <- evenness_segments
        nc <- ncol(target_streamlines)
        target_streamlines@data <- data.frame(target_streamlines@data, predictions_df)
        names(target_streamlines) <- c(head(names(target_streamlines), nc), lvls)

        fname <- file.path(paste0(PATH, "/", region), paste0(region, '_', gsub("classif.", "", lrn) ,'_calibrated_predictions'))
        if(file.exists(paste0(fname, ".shp"))){
            file.remove(paste0(fname, ".shp")) # clean the attribute names in the shapefile
            file.remove(paste0(fname, ".prj")) # clean the attribute names in the shapefile
            file.remove(paste0(fname, ".shx")) # clean the attribute names in the shapefile
            file.remove(paste0(fname, ".dbf")) # clean the attribute names in the shapefile
        }
        shapefile(target_streamlines[ind, ], paste0(fname, ".shp"), overwrite = TRUE)

We now turn to the calculation of the entropy rate. This is derived from the transition probability matrix which we first have to compute. We first turn the target_streamlines into a network that we can work on with the igraph package. This allows to derive the incident edges at each vertex and thus which class is connected to which other. When we average this connectivity matrix over the entire network we obtain the probability of transition between the different classes. We can readily use this transition probability matrix to derive the entropy rate, that is the limit value of entropy (or surprise) while one walks along the stream network.

        numeric_group <- as.factor(as.integer(as.numeric(target_streamlines$group)))
        fname <- file.path(paste0(PATH, "/", region), paste0(lrn, ".Rds"))
        streamlines_network <- sfnetworks::as_sfnetwork(sf::st_as_sf(target_streamlines), directed = FALSE)
        igraph_object <- igraph::as.igraph(streamlines_network)
        ie <- igraph::incident_edges(igraph_object, V(igraph_object))
        ie_list <- lapply(ie, function(es)  as.vector(table(numeric_group[as_ids(es)])))
        ie_df <- do.call(rbind, ie_list)
        ie_df <- ie_df[which(rowSums(ie_df) >= 2), ]

        ie_count <- lapply(seq(ncol(ie_df)), function(i){
            .df <- ie_df[which(ie_df[, i] > 0), ]
            sapply(seq(ncol(ie_df)), function(j){
                ifelse(i==j, sum(.df[.df[, j] > 1, j]), sum(.df[, j]))
            })
        })
        count_matrix <- do.call(rbind, ie_count)
        prob_matrix <- sweep(count_matrix, MARGIN = 1, rowSums(count_matrix), "/")
        saveRDS(prob_matrix, fname)

        H <- ccber::CalcMarkovEntropyRate(prob_matrix, ccber::CalcEigenStationary(prob_matrix))
        entropy_rate[[region]][[lrn]] <- H
        regional_entropy[[region]]$training <- shannon_network_training
        regional_entropy[[region]][[lrn]] <- shannon_network_target

Let’s not forget to exit the loop over learners and the one over regions:

}
}

Finally, get_entropy_df() function provides some formatting for the entropy_rate list.

References

Adnan, Md Nasim, and Md Zahidul Islam. 2015. “One-Vs-All Binarization Technique in the Context of Random Forest.” In Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, 385–90.

DeGroot, Morris H, and Stephen E Fienberg. 1983. “The Comparison and Evaluation of Forecasters.” Journal of the Royal Statistical Society: Series D (the Statistician) 32 (1-2): 12–22.

Dong, Lin, Eibe Frank, and Stefan Kramer. 2005. “Ensembles of Balanced Nested Dichotomies for Multi-Class Problems.” In European Conference on Principles of Data Mining and Knowledge Discovery, 84–95. Springer.

Kijsirikul, B., and N. Ussivakul. 2002. “Multiclass Support Vector Machines Using Adaptive Directed Acyclic Graph.” In Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN02 (Cat. No.02CH37290). IEEE. https://doi.org/10.1109/ijcnn.2002.1005608.

Lorena, Ana Carolina, and André C. P. L. F. de Carvalho. 2010. “Building Binary-Tree-Based Multiclass Classifiers Using Separability Measures.” Neurocomputing 73 (16-18): 2837–45. https://doi.org/10.1016/j.neucom.2010.03.027.

Melnikov, Vitalik, and Eyke Hüllermeier. 2018. “On the Effectiveness of Heuristics for Learning Nested Dichotomies: An Empirical Analysis.” Machine Learning, June. https://doi.org/10.1007/s10994-018-5733-1.

Niculescu-Mizil, Alexandru, and Rich Caruana. 2005. “Predicting Good Probabilities with Supervised Learning.” In Proceedings of the 22nd International Conference on Machine Learning - ICML 05. ACM Press. https://doi.org/10.1145/1102351.1102430.

Platt, John C, Nello Cristianini, and John Shawe-Taylor. 2000. “Large Margin Dags for Multiclass Classification.” In Advances in Neural Information Processing Systems, 547–53.

Platt, John, and others. 1999. “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods.” Advances in Large Margin Classifiers 10 (3): 61–74.

Quiterio, Thaise M, and Ana C Lorena. 2016. “Determining the Structure of Decision Directed Acyclic Graphs for Multiclass Classification Problems.” In Intelligent Systems (Bracis), 2016 5th Brazilian Conference on, 115–20. IEEE.

Zadrozny, Bianca. 2002. “Reducing Multiclass to Binary by Coupling Probability Estimates.” In Advances in Neural Information Processing Systems, 1041–8.

Zadrozny, Bianca, and Charles Elkan. 2002. “Transforming Classifier Scores into Accurate Multiclass Probability Estimates.” In Proceedings of the Eighth Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 694–99. ACM.

Making predictions