In this chapter, we are going to use the data prepared in Chapter 2, train a Random Forest model, and evaluate the variable importance of the different predictors_all used in the model. To recap the basics of the Random Forest algorithm and its implementation and use in R, head over to Chapter 11 of AGDS book.
3.1 Load data
In the previous Chapter, we created a dataframe that holds information on the soil sampling locations and the covariates that we extracted for these positions. Let’s load this datafarme into our environment
Before we can fit the model, we have to specify a few settings. First, we have to specify our target and predictor variables. Then, we have to split our dataset into a training and a testing set. Random Forest models cannot deal with NA values, so we have to remove these from our training set.
In the dataset we work with here, the data splitting into training and testing sets is defined by the column dataset. Usually, this is not the case and we split the data ourselves. Have a look at the section in AGDS Book on data splitting.
As predictors_all we use all the variables for which we have data with spatial coverage - the basis for spatial upscaling. We extracted these data in Chapter 2 from geospatial files and the data frame was constructed by cbind(), where columns number 14-104 contain the covariates data, extracted from the geospatial files. See Chapter 6 for a description of the variables. We use all of them here as predictors_all for the model.
Code
# Specify target: The pH in the top 10cmtarget <-"ph.0.10"# Specify predictors_all: Remove soil sampling and observational datapredictors_all <-names(df_full)[14:ncol(df_full)]cat("The target is:", target,"\nThe predictors_all are:", paste0(predictors_all[1:8], sep =", "), "...")
The target is: ph.0.10
The predictors_all are: NegO, PosO, Se_MRRTF2m, Se_MRVBF2m, Se_NO2m_r500, Se_PO2m_r500, Se_SAR2m, Se_SCA2m, ...
Code
# Split dataset into training and testing setsdf_train <- df_full |> dplyr::filter(dataset =="calibration")df_test <- df_full |> dplyr::filter(dataset =="validation")# Filter out any NA to avoid error when running a Random Forestdf_train <- df_train |> tidyr::drop_na()df_test <- df_test |> tidyr::drop_na()# A little bit of verbose output:n_tot <-nrow(df_train) +nrow(df_test)perc_cal <- (nrow(df_train) / n_tot) |>round(2) *100perc_val <- (nrow(df_test) / n_tot) |>round(2) *100cat("For model training, we have a calibration / validation split of: ", perc_cal, "/", perc_val, "%")
For model training, we have a calibration / validation split of: 75 / 25 %
Alright, this looks all good. We have our target and predictor variables saved for easy access later on and the 75/25% split of calibration and validation data looks good, too. We can now move on to model fitting.
3.3 Model training
The modelling task is to predict the soil pH in the top 10 cm. Let’s start using the default hyperparameters used by ranger::ranger().
Tip
Have a look at the values of the defaults by entering. ?ranger::ranger in your console and study the function documentation.
Code
# ranger() crashes when using tibbles, so we are using the# base R notation to enter the datarf_basic <- ranger::ranger( y = df_train[, target], # target variablex = df_train[, predictors_all], # Predictor variablesseed =42, # Specify the seed for randomization to reproduce the same model againnum.threads = parallel::detectCores() -1) # Use all but one CPU core for quick model training# Print a summary of fitted modelprint(rf_basic)
Ranger result
Call:
ranger::ranger(y = df_train[, target], x = df_train[, predictors_all], seed = 42, num.threads = parallel::detectCores() - 1)
Type: Regression
Number of trees: 500
Sample size: 605
Number of independent variables: 91
Mtry: 9
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 0.321395
R squared (OOB): 0.4560032
Predicting categories with Random Forests
If our target variable was a categorical and not a continuous variable, we would have to set the argument probability = TRUE. The output would then be a probability map from 0-100%.
Although we only used the pre-defined parameters, we already get a fairly good out-of-bag (OOB) \(R^2\) of 0.45 and a MSE of 0.32 pH units. See here for more background on OOB error estimation with Random Forests.
This is the step at which you may want to reduce the number of predictors_all to avoid collinearity and the risk of overfitting. You may also want to optimize the hyperparameters for improving the model performance and generalisability. Different hyperparameter specifications of the Random Forest model that control the model complexity may be compared. A simple way to do that is to use the {caret} R package which provides machine learning wrapper functions for hyperparameter tuning (among many more functionalities). Its use in combination with Random Forest is demonstrated in Chapter 11 of AGDS book. Reducing the number of predictors_all and retaining only the most important ones is important for obtaining robust model generalisability and is approached by what is shown below.
3.4 Variable importance
Our model has 91 variables, but we don’t know anything about their role in influencing the model predictions and how important they are for achieving good predictions. ranger::ranger() provides a built-in functionality for quantifying variable importance based on the OOB-error. This functionality can be controlled with the argument importance. When set to 'permutation', the algorithm randomly permutes values of each variable, one at a time, and measures the importance as the resulting decrease in the OOB prediction skill of each decision tree within the Random Forest and returns the average across importances of all decision trees. Note that this is a model-specific variable importance quantification method. In AGDS Book Chapter 12, you have learned about a model model-agnostic method.
The model object returned by the ranger() function stores the variable importance information. The code below accesses this information and sorts the predictor variables with decreasing importance. If the code runs slow, you can also use the faster impurity method (see more information here).
Code
# Let's run the basic model again but with recording the variable importancerf_basic <- ranger::ranger( y = df_train[, target], # target variablex = df_train[, predictors_all], # Predictor variablesimportance ="permutation", # Pick permutation to calculate variable importanceseed =42, # Specify seed for randomization to reproduce the same model againnum.threads = parallel::detectCores() -1) # Use all but one CPU core for quick model training# Extract the variable importance and create a long tibblevi_rf_basic <- rf_basic$variable.importance |> dplyr::bind_rows() |> tidyr::pivot_longer(cols = dplyr::everything(), names_to ="variable")# Plot variable importance, ordered by decreasing valuegg <- vi_rf_basic |> ggplot2::ggplot(ggplot2::aes(x =reorder(variable, value), y = value)) + ggplot2::geom_bar(stat ="identity", fill ="grey50", width =0.75) + ggplot2::labs(y ="Change in OOB MSE after permutation", x ="",title ="Variable importance based on OOB") + ggplot2::theme_classic() + ggplot2::coord_flip()# Display plotgg
What do we see here? The higher the value, the stronger the effect of permutation on the model performance, the more important the variable. The five most important variables are the following:
Importance rank
Variable name
Description
1
mt_rr_y
Mean annual precipitation
2
mt_tt_y
Mean annual temperature
3
mt_td_y
Mean annual dew point temperature
4
mt_gh_y
Mean annual incoming radiation
5
be_gwn25_vdist
Horizontal distance to water body at 25m resolution
We find that the mean annual precipitation is by far the most important variable in determining soil pH in our model. From a soil-forming perspective, this seems plausible (Dawson, 1977). We further find that the four most important variables all describe climate - reflecting its important role as a soil-forming factor. Most of the remaining variables are metrics of the topography. It should also be noted that many of them may be correlated since. Some of them measure the same aspect of topography, but derived from a digital elevation model given at different spatial resolution (see Chapter 6). Due to their potential correlation, dropping one of the affected variables from the model may thus not lead to a strong deterioration of the model skill as its (correlated) information is still contained in the remaining variables.
3.5 Variable selection
The large number of variables in our model and the tendency that many of them exhibit a low importance in comparison to the dominating few, and that they may be correlated calls for a variable selection. Reducing the number of predictors_all reduces the risk that remaining predictors_all are correlated. Having correlated predictors_all is a problem - as shown in the context of spatial upscaling by (Ludwig et al., 2023). Intuitively, this makes sense in view of the fact that if \(x_i\) and \(x_j\) are correlated, then, for example, \(x_i\) is used for modelling its true association with the target variable, while \(x_j\) can be “spent” to model randomly occurring covariations with the target - potentially modelling noise in the data. If this happens, overfitting will follow (see here).
Different strategies for reducing the number of predictors_all, while retaining model performance and improving model generalisability, exist. Greedy search, or stepwise regression are often used. Their approach is to sequentially add (stepwise forward) or remove (stepwise backward) predictors_all and to determine the best (complemented or reduced) set of predictors_all in terms of respective model performance at each step. The algorithm stops once the model starts to deteriorate or stops improving. However, it should be noted that these algorithms don’t assess all possible combinations of predictors_all and may thus not find the “globally” optimal model. A stepwise regression was implemented in AGDS I as a Report Exercise (see here).
Tip
To dig deeper into understanding how the model works, we could further investigate its partial dependence plots (see here).
An alternative approach to model selection is to consider the variable importance. predictors_all may be selected based on whether removing their association with the target variable (by permuting values of the predictor) deteriorates the model. Additionally, a decision criterion can be introduced for determining whether or not to retain the respective variable. This is implemented by the “Boruta-Algorithm” - an effective and popular approach to variable selection (Kursa & Rudnicki, 2010). Boruta is available as an R package {Boruta}, is based on Random Forests, and performs a permutation of variables for determining their importance - as described in Chapter 12 of AGDS Book for model-agnostic variable importance estimation. The algorithm finally categorizes variables into "Rejected", "Tentative", and "Confirmed". Let’s apply Boruta on our data.
Code
set.seed(42)# run the algorithmbor <- Boruta::Boruta(y = df_train[, target], x = df_train[, predictors_all],maxRuns =50, # Number of iterations. Set to 30 or lower if it takes too longnum.threads = parallel::detectCores()-1)# obtain results: a data frame with all variables, ordered by their importancedf_bor <- Boruta::attStats(bor) |> tibble::rownames_to_column() |> dplyr::arrange(dplyr::desc(meanImp))# plot the importance result ggplot2::ggplot(ggplot2::aes(x =reorder(rowname, meanImp), y = meanImp,fill = decision), data = df_bor) + ggplot2::geom_bar(stat ="identity", width =0.75) + ggplot2::scale_fill_manual(values =c("grey30", "tomato", "grey70")) + ggplot2::labs(y ="Variable importance", x ="",title ="Variable importance based on Boruta") + ggplot2::theme_classic() + ggplot2::coord_flip()
Tip
Determine the length \(N\) of the vector of predictors_all deemed important by ("Confirmed") Boruta and compare these “important” variables with the \(N\) most important variables of the OOB-based variable importance estimation demonstrated above.
For the spatial upscaling in the context of digital soil mapping, let’s retain only the variables deemed important ("Confirmed") by the Boruta algorithm and retrain a final Random Forest model. The number of retained variables is 33.
Code
# get retained important variablespredictors_selected <- df_bor |> dplyr::filter(decision =="Confirmed") |> dplyr::pull(rowname)length(predictors_selected)
[1] 32
Code
# re-train Random Forest modelrf_bor <- ranger::ranger( y = df_train[, target], # target variablex = df_train[, predictors_selected], # Predictor variablesseed =42, # Specify the seed for randomization to reproduce the same model againnum.threads = parallel::detectCores() -1) # Use all but one CPU core for quick model training# quick report and performance of trained model objectrf_bor
Ranger result
Call:
ranger::ranger(y = df_train[, target], x = df_train[, predictors_selected], seed = 42, num.threads = parallel::detectCores() - 1)
Type: Regression
Number of trees: 500
Sample size: 605
Number of independent variables: 32
Mtry: 5
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 0.3121996
R squared (OOB): 0.4715675
Tip
Compare the skill of the models with all predictors_all and with the Boruta-informed reduced set of predictors_all. What is happening?
Save the model object with the reduced set of predictors_all, calibration, and validation data for the subsequent Chapter.
Code
# Save relevant data for model testing in the next chapter.saveRDS(rf_bor, here::here("data/rf_for_ph0-10.rds"))saveRDS(df_train[, c(target, predictors_selected)], here::here("data/cal_for_ph0-10.rds"))saveRDS(df_test[, c(target, predictors_selected)], here::here("data/val_for_ph0-10.rds"))
Dawson, G. A. (1977). Atmospheric ammonia from undisturbed land. Journal of Geophysical Research (1896-1977), 82(21), 3125–3133. https://doi.org/10.1029/JC082i021p03125
Kursa, M. B., & Rudnicki, W. R. (2010). Feature Selection with the BorutaPackage. Journal of Statistical Software, 36(11), 1–13. https://doi.org/10.18637/jss.v036.i11
Ludwig, M., Moreno-Martinez, A., Hölzel, N., Pebesma, E., & Meyer, H. (2023). Assessing and improving the transferability of current global spatial prediction models. Global Ecology and Biogeography, 32(3), 356–368. https://doi.org/10.1111/geb.13635