1  The data

We will use a dataset of leaf nitrogen (N) content, measured in the field. The leaf N content is central for understanding the photosynthesis rates and biogeochemical cycling of N and C in terrestrial ecosystems. A rich body of literature has investigated global patterns of leaf N across the Earth’s biomes and the relationships of leaf N to environmental factors. In recent years, leaf N data collected in the field by a large number of individual campaigns, has been collated into homogenised and analysis-ready data compilations. “Small data” has been made “big”. Thanks to the fact that these data are geolocalised, covariate data from files with global coverage can be extracted and used to complement the observational leaf N data and to model leaf N on the basis of environmental covariates.

Research in the our group (GECO, Institute of Geography University of Bern) has generated such analysis-ready leaf N data, complemented with environmental covariates, and made openly accessible on GitHub.

Load the data directly from its online source on GitHub.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- readr::read_csv("https://raw.githubusercontent.com/stineb/leafnp_data/main/data/leafnp_tian_et_al.csv")
Rows: 36414 Columns: 66
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): FunGroups, Dc_Db_Ec_Eb_Hf_Hg, tree_shrub_Herb, Family_New, Family,...
dbl (56): lon, lat, leafN, leafP, LeafNP, Lat_Di_check_final, Lon_Di_check_f...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We will work with a limited subset of the variables available in the file, and with the data aggregated by sites (identified by their respective longitudes and latitudes):

common_species <- df |> 
  group_by(Species) |> 
  summarise(count = n()) |> 
  arrange(desc(count)) |> 
  slice(1:50) |> 
  pull(Species)

dfs <- df |> 
  dplyr::select(leafN, lon, lat, elv, mat, map, ndep, mai, Species) |> 
  filter(Species %in% common_species)
  # group_by(lon, lat) |> 
  # summarise(across(where(is.numeric), mean))

# quick overview of data
skimr::skim(dfs)
Data summary
Name dfs
Number of rows 22472
Number of columns 9
_______________________
Column type frequency:
character 1
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Species 0 1 10 23 0 50 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
leafN 0 1 15.58 5.11 1.02 12.22 14.57 17.50 54.04 ▂▇▂▁▁
lon 0 1 18.08 27.33 -157.79 5.24 13.62 19.95 140.59 ▁▁▇▂▁
lat 0 1 48.35 9.55 -37.49 43.00 48.99 52.44 69.75 ▁▁▁▆▇
elv 0 1 494.28 469.48 -5.00 135.00 357.00 716.00 4847.90 ▇▂▁▁▁
mat 0 1 8.80 3.76 -4.88 6.97 8.64 10.30 29.96 ▁▇▅▁▁
map 0 1 818.59 314.40 105.19 607.29 721.15 955.24 3641.73 ▇▅▁▁▁
ndep 0 1 1.22 0.51 0.07 0.82 1.22 1.55 2.68 ▂▅▇▃▁
mai 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▅▇▃▁▁
# show missing data
visdat::vis_miss(dfs)