1  Introduction

Disclaimer

This tutorial builds on the course Applied Geodata Science I and requires basic knowledge in using the tidyverse and Random Forests for data analysis.

1.1 A primer on spatial data science

Spatial data science combines geography, statistics, computer science, and data science to analyze and interpret spatially referenced data. It focuses on uncovering patterns and relationships in geospatial data to gain insights into spatial phenomena. By integrating locational information, spatial data science provides a deeper understanding of complex spatial patterns and processes. There are three key aspects of spatial data science that allow such a deeper understanding:

  • Spatial Data Visualization: Visualizing spatial data through maps and interactive visualizations helps communicate complex spatial information effectively. A brief introduction to working with geospatial data in R is given here.

  • Spatial Data Analysis: Techniques such as spatial clustering, spatial autocorrelation analysis, and spatial regression reveal spatial patterns, trends, and dependencies.

  • Geospatial Machine Learning: Applying machine learning algorithms to spatial data enables the creation of predictive models for spatially explicit predictions.

Combined, working on each of these aspects allows for a variety of real-world applications. For example, in urban planning, ecology, transportation, public health, and social sciences, we can apply similar methods to solve a problem. The knowledge inferred from maps, statistical analysis, and model prediction reveals fundamental processes to understand spatial relationships and to eventually improve decision-making.

1.2 A primer on soil science

Generally, any soil is the result of five key pedogenetic factors (Jenny, 1994). Abbreviated, they can be simply memorized with the mnemonic “CLORPT”: soil = f(CLimate, Organisms, topogRaphy, Parent material, Time, … ). The “” stands for additional factors, which have been recognized to be important factors but do not fall under the original CLORPT scheme. soil can stand for various soil properties like its texture, density, pH, water drainage, cation exchange capacity, organic matter content, etc.

Due to the intensification of land use through agriculture and urbanization, soils across the world are increasingly under threat. Yet, soils provide crucial ecosystem services to us, such as supporting our food system, draining water during heavy rainfalls, and storing massive amounts of carbon (see Figure 1.1). To quantify such services and assess the risks associated with losing soils, good maps are needed to provide information on where soil service is delivered. Creating such maps through exhaustive sampling campaigns is very costly, time-intensive, and often at coarse temporal and spatial resolution. Also, such sampling campaigns provide only a snapshot of the historical and current state - they obviously cannot tell us anything about the soil’s future. So, there is a strong demand for models that provide us with information on a soil’s future across large spatial scales.

Figure 1.1: Soil functions as defined per FAO, Figure taken from Baveye et al. (2020).

Luckily, information on the CLORPT variables is often available at large continuous scales, for example, spatial data products for climate data, digital elevation models, and geological maps. This information is highly useful for creating models that can predict key soil properties and services across the same area for which we have data. So, if we can produce a robust model, we can massively simplify sampling efforts, remove the need for hand-drawn maps, and inform decision-making processes in a cost- and labor-effective manner.

The increase in data abundance and computational resources and advances in statistics have put forward such statistical models. The use of Random Forests has gained a lot of traction due to being relatively simple whilst highly flexible (Hengl et al., 2018). Therefore, they are a perfect match for creating digital maps of all sorts of soil properties in a quick and simple manner.

However, note that the variety in pedogenetic factors and soil properties comes with an equal variety of data types with variables being numerical (capped like % of clay content, or un-capped like organic matter content), binary (e.g., presence of water at 0-10 cm soil depth), categorical (more than two without an order), ordinal (more than two with an order), or interval (cutting numerical values into intervals). Moreover, this data can come in different data formats. Due to this abundance of data formats and their peculiarities, it is of great importance to properly understand your data. Only when you know your data well you can pick a suitable statistical model to address your research question.

1.3 Case-Study: Digital Soil Mapping with Random Forests

In this tutorial, we are looking at a specific case of spatial upscaling: digital soil mapping using Random Forests. This means that we want to predict soil properties that are difficult and laborious to obtain with a model that predicts such properties smoothly across space by exploiting available spatial data such as climate data. Here’s a short checklist of what a good geo-spatial model should do. It should…

  • … capture non-linear relations because pedogenesis is a non-linear process.

  • … be able to use and predict continuous and categorical variables.

  • … handle multiple correlated variables without the risk of over-fitting.

  • … build models with good predictive power.

  • … result in a sparse model, keeping only relevant predictors.

  • … quantify prediction accuracy and uncertainty.

In the next chapters, we use a dataset on basic soil properties from sampling locations across the canton of Bern and pair it up with climatic variables (temperature, precipitation, radiation), terrain attributes (derivatives from digital elevation models like slope, northness, eastness, topographic water index, etc.), geological maps, and soil maps (Nussbaum et al., 2018). The following chapters will cover the preparation of this data (Chapter 2), fitting a Random Forest model (Chapter 3), and evaluating this model (Chapter 4). The final Chapter 5 holds the exercise description of this tutorial.

If you want to learn more about the underlying theory and similar techniques, we highly recommend the presentations by Madlene Nussbaum given at the summer school of the OpenGeoHub Foundation (see part 1 and part 2) and her papers: Nussbaum et al. (2017), Nussbaum et al. (2018)).