5  Report Exercise

This tutorial demonstrated digital soil mapping for a continuous variable - top soil pH. The observational data from file data-raw/soildata/berne_soil_sampling_locations.csv contains also categorical variables. Variables waterlog.* provide information about whether the soil was waterlogged at different depths (30, 50, and 100 cm). It is either TRUE (encoded in the model as 1) or FALSE.

5.1 Simple model

Re-implement the digital soil mapping workflow, using Random Forest, as demonstrated in this tutorial, but for the binary categorical variable waterlog.100. Here are a few hints as a guide:

  • Make sure that the categorical target variable is encoded as a factor using the function factor().
  • Start with a model that includes all predictors, trained on the pre-defined training subset.
  • Evaluate the model on the testing subset of the data. Consider appropriate metrics as described in AGDS Book Chapter 8.3. Is the data balanced in terms of observed TRUE and FALSE values? What does this imply for the interpretation of the different metrics?

5.2 Variable selection

  • Reduce the predictor set as demonstrated in this tutorial.
  • Repeat the model evaluation and compare the model performance on the test set with what was obtained with the model using all available covariates. Which model generalises better to unseen data?
  • Would the same model choice be made if we considered the OOB prediction error reported as part of the trained model object?

5.3 Model optimization

In AGDS Book Chapter 11, you learned how to optimize hyperparameters using cross-validation. Using the training data subset, implement a 5-fold cross-validation to optimise the hyperparameters mtry and min.node.size of the same Random Forest model as implemented above. You may use the {caret} library as demonstrated in AGDS Book. Evaluate the optimized model on the test set using the same metrics as considered above. Does the model generalise better to unseen data than the initial model (which used default hyperparameters, see ?ranger::ranger).

5.4 Probabilistic predictions

Using the optimised (or if you didn’t manage - the initial default) hyperparameters, train the Random Forest model, setting ranger::ranger(..., probability = TRUE). This yields not a model predicting a binary class, but a probability of the target to be TRUE. This lets the user chose where to put the threshold for translating a probability to a binary class. E.g., if the predicted probability is \(>0.5\), then consider this as a prediction of TRUE. Establish the Reicever-operating-characteristic curve, as described in AGDS Book Chapter 8.3.

Consider you inform an infrastructure construction project where waterlogged soils severely jeopardize the stability of the building. Then, consider you inform a project where waterlogged soils are unwanted, but not critical. In both cases, your prediction map of a binary classification is used as a basis and the binary classification is derived from the probabilistic prediction. How would you chose the threshold in each case? Would you chose the same threshold in both cases? If not, explain why. Can you think of an analogy of a similarly-natured problem in another realm?