Introduction

The sheer volume of data that is becoming available today bears a huge potential for answering long-standing questions in all fields of environmental and geo-sciences. This gives rise to a new set of tools that can be used and a new set of challenges when applying them.

What is Applied Geodata Science?

Data science is interdisciplinary by nature. It sits at the intersection between domain expertise, Statistics and Mathematics knowledge, and coding skills. Data science generates new insights for applications in different fields by combining these three realms (Fig. 0.2). Combining only two of the three realms falls short of what data science is (Conway, 2013).

The Venn diagram of data science. Adapted from [Conway, 2013](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram).

Figure 0.2: The Venn diagram of data science. Adapted from Conway, 2013.

Dealing with data requires coding (but not a degree in computer science). Coding skills are essential for file and data manipulation and for thinking algorithmically.

Basic knowledge in Statistics and Mathematics are needed for extracting insights from data and for applying appropriate statistical methods. An overview of methods, a general familiarity, and an intuitive understanding of the basics are more important for most data science projects than having a PhD in Statistics.

Statistics plus data yields machine learning, but not “data science”. In data science, questions and hypotheses are motivated by the scientific endeavor in different domains or by applications in the public or private sectors. To emphasize the distinctively applied and domain-oriented approach to data science of this course, we call it Applied Geodata Science.

Of course, empirical research has always relied on data. The essential ingredient of a course in (Applied Geo-) data science is that it emphasizes the methodological aspects that are unique and critical for data-intensive research in Geography and Environmental Sciences, and for putting Open Science into practice.

This course is also supposed to teach you how to stay out of the “danger zone” - where data is handled and models are fitted with a blind eye to fundamental assumptions and relations. The aim of data science projects is to yield credible (“trustworthy”) and robust results.

The data science workflow

The red thread of this course is the data science workflow (Fig. 0.3). Applied (geo-) data science projects typically start with research questions and hypotheses, and some data at hand, and (ideally) end with an answer to the research questions and the communication of results in textual, visual, and reproducible forms. What lies in between is not a linear process, but a cycle. One has to “understand” the data in order to identify appropriate analyses for answering the research questions. Before we’ve visualized the data, we don’t know how to transform it. And before we’ve modeled it, we don’t know the most appropriate visualization. In practice, we approach answers to our research questions gradually, through repeated cycles of exploratory data analysis - repeated cycles of transforming the data, visualizing it, and modelling relationships. More often than not, the exploratory data analysis generates insights about missing pieces in the data puzzle that we’re trying to solve. In such cases, the data collection and modelling task may have to be re-defined (dashed line in Fig. 0.3), and the exploratory data analysis cycle re-initiated.

The data science workflow. Figure adapted from: [Wickham and Grolemund *R for Data Science*](https://r4ds.had.co.nz/index.html)

Figure 0.3: The data science workflow. Figure adapted from: Wickham and Grolemund R for Data Science

As we work our way through repeated cycles of exploratory data analysis, we take decisions based on our data analysis, modelling, and visualizations. And we write code. The final conclusions we draw, the answers to research questions we find, and the results we communicate rest on the combination of all steps of our data processing, analysis, and visualization. Simply put, it rests on the reproducibility (and legibility) of our code (encapsulated by ‘Program’ in Fig. 0.3).

Why now?

Three general developments set the stage for this course. First, Geography and Environmental Sciences (as many other realms of today’s world) have entered a data-rich era (Chapters 5). Second, machine learning algorithms have revolutionized the way we can extract information from large volumes of data (this Chapter and Chapters 10 - 11). Third, Open Science principles (Chapter 6) - essential for inclusive research, boundless progress, and for diffusing science to society - are becoming a prerequisite for getting research funded and published. The skill set required to make use of the potentials of a data-rich world is diverse and is often not taught as part of the curriculum in the natural sciences (as of year 2023). This course fills this open space.

A new modelling paradigm

What is ‘modelling’? Models are an essential part of the scientific endeavor. They are used for describing the world, explaining observed phenomena, and for making predictions that can be tested with data. Models are thus a device for translating hypotheses of how the world operates into a form that can be confronted with how the world is observed.

Models can be more or less explicit and more or less quantitative. Models can come in the form of vague mental notions that underpin our view of the world and our interpretation of observations. Towards the more specific end of this spectrum, models can be visualizations. For example a visualization of how elements in a system are connected. At the arguably most explicit and quantitative end of the spectrum are models that rely on mathematical descriptions of how elements of a system are connected and how processes operate. Examples of such models include General Circulation Models of the climate system or models used for Numerical Weather Prediction. Such models are often referred to as mechanistic models.

A further distinction within mechanistic models can be made between dynamic models that describe a temporal evolution of a system (e.g., the dynamics of the atmosphere and the ocean in a General Circulation Model) and “static” models (e.g., a model for estimating the power generation of a solar photovoltaics station). In a dynamic model, we need to specify an initial state and the model (in many cases given additional inputs) predicts the evolution of the system from that. In a static model, the prediction can be described as a function of a set of inputs, without temporal dependencies between the inputs and the model prediction.

Often, mechanistic and empirical models (or, here used as synonym, statistical models) are distinguished. Empirical models can be viewed as somewhere closer towards the less explicit end of the spectrum described above. In mechanistic models, the mathematical descriptions of relationships are informed by theory or by independently determined relationships (e.g., laboratory measurements of metabolic rates of an enzyme). In contrast, empirical models rely on no, or only a very limited amount of a priori knowledge that is built into the model formulation. However, it should be noted that mechanistic models often also rely on empirical or statistical descriptions for individual components (e.g., the parametrisation of convection in a climate model), and statistical models may, in some cases, also be viewed as a representation of mechanisms that reflects our theoretical understanding. For example, depending on whether a relationship between two variables is linear or saturating by nature, we would chose a different structure of an empirical model. An specific example is the light use efficiency model (Monteith, 1972) that linearly relates vegetation productivity to the amount of absorbed solar radiation. It simply has the form of a bivariate linear regression model. Vice-versa, traditional statistical models also rely on assumptions regarding the data generating process(es) and the resulting distribution of the data.

Supervised machine learning models can be regarded as empirical models that are even more “assumption free” than traditional statistical models. In contrast to mechanistic models where rules and hypotheses are explicitly and mathematically encoded, and in contrast to statistical models where assumptions of the data distribution are made for specifying the model, machine learning approaches modelling from the flip side: from the data to the insight (Breiman, 2001). Rules are not encoded by a human, but discovered by the machine. Machine learning models learn from patterns in the data for making new predictions, rather than relying on theory and a priori knowledge of the system. In that sense, machine learning follows a new modelling paradigm. The learning aspect in machine learning refers to the automatic search process and the guidance of the model fitting by some feedback signal (loss function) that are employed in machine learning algorithms (see also Chapter 10).

The aspect of “patterns in the data” is key here. Often, these patterns are fuzzy. Rule-based algorithms have a limited capacity for dealing with such problems. Symbolic artificial intelligence is based on rules and underlies, for example, a computer playing chess (Chollet & Allaire, 2018). However, where rules cannot be encoded from the outset, symbolic artificial intelligence has reached its limits. A breakthrough in learning from fuzzy patterns in the data has been enabled by deep learning. Through multiple layers of abstraction of the data, deep learning models identify underlying, abstract, relationships and use them for prediction. Deep learning has been extremely successful in solving problems, e.g., in image classification, speech recognition, or language translation.

However, the abstraction comes at the cost of interpretability. Deep learning models and machine learning models in general are used with an emphasis on prediction and have seen particularly wide adoption in fields where a false prediction has acceptable consequences (An inappropriate book recommendation based on your previous purchases is not grave.) (Knüsel et al., 2019). The model itself remains a black box and its utility for hypothesis testing is limited. This challenge has spurred the field of interpretable machine learning, where solutions are sought for uncovering the black box and probe the model for its trustworthiness.

Chapters 8-11 lead into the world of machine learning and introduce the essential steps of the modelling workflow without delving into deep learning. Together with its preceeding chapters, this completes the toolbox required for making the first data scientific steps for applications in Geography and Environmental Sciences. This may be only just the beginning…