making_predictions.Rmd
The first step in making predictions is retrieving the trained models and parameters from the benchmark. As it is likely that the users are running more than one iterations of one benchmark (or in general more than one benchmark), this is handled by the PATH
variable. This is similar to the PATH
specified when training the models.
The PATH
directory should contains a PARAMETERS.Rds
corresponding to the final model parameters. We can retrieve the parameters we need from it with:
PARAMETERS <- readRDS(file.path(PATH, paste0("PARAMETERS.Rds"))) PROB <- PARAMETERS$PROB PREPROC <- PARAMETERS$PREPROC REGIONS <- PARAMETERS$REGIONS LRN_IDS <- PARAMETERS$LRN_IDS
We also have to specify if feature selection has been activated. If so: TUNE_FS <- TRUE
and we have to edit the PATH
variable. Additionally we perform some cosmetic changes in the .id
variables and filter for the learners that have a final model.
Obtaining the predictions for the training dataset is handled with get_predictions()
with TARGET = FALSE
.
training_preds <- get_predictions(REGIONS, LRN_IDS, PATH, TARGET = FALSE, TUNE_FS)
Obtaining the predictions for the target dataset is handled with get_predictions()
with TARGET = TRUE
.
target_preds <- get_predictions(REGIONS, LRN_IDS, PATH, TARGET = TRUE, TUNE_FS)
Most machine learning models output posterior probabilities that often require calibration. Such calibration corrects for the potential distortion of the posterior probabilities when compared to empirical probabilities and improves model performance (DeGroot and Fienberg 1983; Zadrozny 2002; Niculescu-Mizil and Caruana 2005). Given the sigmoid-shape of most distortions, (Platt and others 1999) proposed a sigmoid calibration to address this effect. This is the so-called Platt’s scaling. Other useful approaches include Bayesian calibration and isotonic scaling (Zadrozny and Elkan 2002) which both require binarizing the problem. Useful references on binarization includes Platt, Cristianini, and Shawe-Taylor (2000); Kijsirikul and Ussivakul (2002); Dong, Frank, and Kramer (2005); Lorena and Carvalho (2010); Quiterio and Lorena (2016); Adnan and Islam (2015); Melnikov and Hüllermeier (2018). In this study, posterior calibration was performed using a multinomial regression; a straightforward extension of the binomial case corresponding to the logistic Platt’s scaling. We use the R
package glmnet
to fit a generalized linear model with an elastic net penalty and with a 10-fold cross-validation with the function get_calibrations()
.
The option plot = TRUE
returns a calibration plot which highlights the amount of calibration per class. It is always a good sign when the predictions do not require a large amount of correction and is a qualitative check on the performance of one or multiple models. Finally, note that the calibration are performed with the original and possibly unequally sampled training dataset.
calibrations <- get_calibrations(REGIONS, LRN_IDS, training_preds, plot = TRUE)
Once the calibrations
have been derived, calibrated_predictions()
calibrates the target predictions:
cal_preds <- calibrated_predictions(REGIONS, LRN_IDS, target_preds, calibrations)
Because this step of the workflow write a number of file, it is easier to run it within a loop statement. Before we dive into the different stage of said loop, some variables have to be initialized. entropy_rate
describes the instability of the predictions from a transition probability point-of-view. regional_entropy
describes the instability of the predictions over the entire prediction region.
entropy_rate <- rep(list(rep(NA, length(LRN_IDS))), length(REGIONS)) names(entropy_rate) <- REGIONS for (region in REGIONS) names(entropy_rate[[region]]) <- LRN_IDS regional_entropy <- rep(list(rep(NA, length(LRN_IDS) + 1)), length(REGIONS)) names(regional_entropy) <- REGIONS for (region in REGIONS) names(regional_entropy[[region]]) <- c("training", LRN_IDS)
We can then enter the loop over area of studies:
for (region in REGIONS){
First we retrieve the target streamlines with:
.target_streamlines <- get_target_streamlines(region)
Then we enter the loop over the ML models:
for (lrn in LRN_IDS){ target_streamlines <- .target_streamlines predictions_df <- as.data.frame(cal_preds[[region]][[lrn]])
If there are post-hoc heuristics, this is where one would implement them. For example, one heuristic could be to assign a class based on the Strahler’s order, here the class K03
in region K
.
if (region == "K"){ predictions_df[which(target_streamlines$StreamOrde >= 5), grepl("K03", colnames(predictions_df))] <- 1 rSum <- rowSums(predictions_df) predictions_df <- sweep(predictions_df, 1, rSum,'/') }
A good check to implement is to ensure that all predictions sum to one (bar some rounding errors).
We can retrieve the response as being the class with the highest probability. Another sanity check at this step is to investigate the distribution of the predicted classes over the entire region. Such a check can uncover early on (that is before any mapping) that some classes are under-represented in the predictions.
lvls <- gsub("prob.", "", colnames(predictions_df)) target_predictions <- apply(predictions_df, MARGIN = 1, FUN = function(row){ lvls[which.max(as.numeric(row))] }) target_predictions <- as.factor(target_predictions) print(table(target_predictions))
From the probabilistic predictions, we can derive a number of metrics of their distribution at the regional level which shannon_weiner()
, richness()
and simpson_evenness()
. Again, this is useful at this stage to capture the potential difference between the distribution of the training labels and the predictions over the entire region. If the sampling of the training dataset has been correctly stratified before, one would expect that at the region level the distribution of the predictions is close to the one of the training labels. If that is not the case, that might indicate some issues with the generalization of the learned patterns.
shannon_network_training <- shannon_weiner(training_preds[[region]]$labels, units = 2) richness_network_training <- richness(training_preds[[region]]$labels) simpson_evenness_network_training <- simpson_evenness(training_preds[[region]]$labels) shannon_network_target <- shannon_weiner(target_predictions, units = 2) richness_network_target <- richness(target_predictions) simpson_evenness_network_target <- simpson_evenness(target_predictions) metrics_df <- data_frame(training = c(richness_network_training, simpson_evenness_network_training, shannon_network_training), target = c(richness_network_target, simpson_evenness_network_target, shannon_network_target)) rownames(metrics_df) <- c("Richness", "Simpson's evenness", "Shannon-Weiner's value") print(metrics_df) if (richness_network_target != richness_network_training) warning("One of the class is not predicted at all over the whole network!")
These metrics can be extended at the stream interval level, that is at the instance level.
shannon_segments <- apply(predictions_df, MARGIN = 1, FUN = shannon_weiner, .prob_bool = TRUE, units = 2) richness_segments <- apply(predictions_df, MARGIN = 1, FUN = richness) evenness_segments <- apply(predictions_df, MARGIN = 1, FUN = simpson_evenness, .prob_bool = TRUE)
We can now assign these values to the target streamlines and save the results into a .shp
shapefile.
target_streamlines$group <- target_predictions target_streamlines$shannon <- shannon_segments target_streamlines$richness <- richness_segments target_streamlines$evenness <- evenness_segments nc <- ncol(target_streamlines) target_streamlines@data <- data.frame(target_streamlines@data, predictions_df) names(target_streamlines) <- c(head(names(target_streamlines), nc), lvls) fname <- file.path(paste0(PATH, "/", region), paste0(region, '_', gsub("classif.", "", lrn) ,'_calibrated_predictions')) if(file.exists(paste0(fname, ".shp"))){ file.remove(paste0(fname, ".shp")) # clean the attribute names in the shapefile file.remove(paste0(fname, ".prj")) # clean the attribute names in the shapefile file.remove(paste0(fname, ".shx")) # clean the attribute names in the shapefile file.remove(paste0(fname, ".dbf")) # clean the attribute names in the shapefile } shapefile(target_streamlines[ind, ], paste0(fname, ".shp"), overwrite = TRUE)
We now turn to the calculation of the entropy rate. This is derived from the transition probability matrix which we first have to compute. We first turn the target_streamlines
into a network that we can work on with the igraph
package. This allows to derive the incident edges at each vertex and thus which class is connected to which other. When we average this connectivity matrix over the entire network we obtain the probability of transition between the different classes. We can readily use this transition probability matrix to derive the entropy rate, that is the limit value of entropy (or surprise) while one walks along the stream network.
numeric_group <- as.factor(as.integer(as.numeric(target_streamlines$group))) fname <- file.path(paste0(PATH, "/", region), paste0(lrn, ".Rds")) streamlines_network <- sfnetworks::as_sfnetwork(sf::st_as_sf(target_streamlines), directed = FALSE) igraph_object <- igraph::as.igraph(streamlines_network) ie <- igraph::incident_edges(igraph_object, V(igraph_object)) ie_list <- lapply(ie, function(es) as.vector(table(numeric_group[as_ids(es)]))) ie_df <- do.call(rbind, ie_list) ie_df <- ie_df[which(rowSums(ie_df) >= 2), ] ie_count <- lapply(seq(ncol(ie_df)), function(i){ .df <- ie_df[which(ie_df[, i] > 0), ] sapply(seq(ncol(ie_df)), function(j){ ifelse(i==j, sum(.df[.df[, j] > 1, j]), sum(.df[, j])) }) }) count_matrix <- do.call(rbind, ie_count) prob_matrix <- sweep(count_matrix, MARGIN = 1, rowSums(count_matrix), "/") saveRDS(prob_matrix, fname) H <- ccber::CalcMarkovEntropyRate(prob_matrix, ccber::CalcEigenStationary(prob_matrix)) entropy_rate[[region]][[lrn]] <- H regional_entropy[[region]]$training <- shannon_network_training regional_entropy[[region]][[lrn]] <- shannon_network_target
Let’s not forget to exit the loop over learners and the one over regions:
}
}
Finally, get_entropy_df()
function provides some formatting for the entropy_rate
list.
Adnan, Md Nasim, and Md Zahidul Islam. 2015. “One-Vs-All Binarization Technique in the Context of Random Forest.” In Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, 385–90.
DeGroot, Morris H, and Stephen E Fienberg. 1983. “The Comparison and Evaluation of Forecasters.” Journal of the Royal Statistical Society: Series D (the Statistician) 32 (1-2): 12–22.
Dong, Lin, Eibe Frank, and Stefan Kramer. 2005. “Ensembles of Balanced Nested Dichotomies for Multi-Class Problems.” In European Conference on Principles of Data Mining and Knowledge Discovery, 84–95. Springer.
Kijsirikul, B., and N. Ussivakul. 2002. “Multiclass Support Vector Machines Using Adaptive Directed Acyclic Graph.” In Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN02 (Cat. No.02CH37290). IEEE. https://doi.org/10.1109/ijcnn.2002.1005608.
Lorena, Ana Carolina, and André C. P. L. F. de Carvalho. 2010. “Building Binary-Tree-Based Multiclass Classifiers Using Separability Measures.” Neurocomputing 73 (16-18): 2837–45. https://doi.org/10.1016/j.neucom.2010.03.027.
Melnikov, Vitalik, and Eyke Hüllermeier. 2018. “On the Effectiveness of Heuristics for Learning Nested Dichotomies: An Empirical Analysis.” Machine Learning, June. https://doi.org/10.1007/s10994-018-5733-1.
Niculescu-Mizil, Alexandru, and Rich Caruana. 2005. “Predicting Good Probabilities with Supervised Learning.” In Proceedings of the 22nd International Conference on Machine Learning - ICML 05. ACM Press. https://doi.org/10.1145/1102351.1102430.
Platt, John C, Nello Cristianini, and John Shawe-Taylor. 2000. “Large Margin Dags for Multiclass Classification.” In Advances in Neural Information Processing Systems, 547–53.
Platt, John, and others. 1999. “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods.” Advances in Large Margin Classifiers 10 (3): 61–74.
Quiterio, Thaise M, and Ana C Lorena. 2016. “Determining the Structure of Decision Directed Acyclic Graphs for Multiclass Classification Problems.” In Intelligent Systems (Bracis), 2016 5th Brazilian Conference on, 115–20. IEEE.
Zadrozny, Bianca. 2002. “Reducing Multiclass to Binary by Coupling Probability Estimates.” In Advances in Neural Information Processing Systems, 1041–8.
Zadrozny, Bianca, and Charles Elkan. 2002. “Transforming Classifier Scores into Accurate Multiclass Probability Estimates.” In Proceedings of the Eighth Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 694–99. ACM.