The package ingestr
provides functions to extract
(ingest) environmental point data (given longitude, latitude, and
required dates) from large global files or remote data servers and
create time series at user-specified temporal resolution (varies for
different data sets).
This is to make your life simpler when downloading and reading site-scale data, using a common interface with a single function for single-site and multi-site ingest, respectively, and a common and tidy format of ingested data across a variety of data sources and formats of original files. Sources, refers to both data sets hosted remotely and accessed through an API and local data sets. ingestr is particularly suited for preparing model forcing and offers a set of functionalities to transform original data into common standardized formats and units. This includes interpolation methods for converting monthly climate data (CRU TS currently) to daily time steps.
The key functions are ingest_bysite()
and
ingest()
for a single-site data ingest and a multi-site
data ingest, respectively. For the multi-site data ingest, site meta
information is provided through the argument siteinfo
which
takes a data frame with columns lon
for longitude,
lat
for latitude, and (for time series downloads)
year_start
and year_end
, specifying required
dates (including all days of respective years). Sites are organised
along rows. An example site meta info data frame is provided as part of
this package for sites included in the FLUXNET2015 Tier 1 data set
(additional columns are not required by ingest_bysite()
and
ingest()
):
The following sources can be handled currently:
Data source | Data type | Coverage | Source ID | Reading from | Remark |
---|---|---|---|---|---|
FLUXNET | time series by site | site | fluxnet |
local files | |
WATCH-WFDEI | time series raster map | global | watch_wfdei |
local files | |
CRU | time series raster map | global | cru |
local files | |
MODIS LP DAAC | time series raster map | global | modis |
remote server | using MODISTools |
Google Earth Engine | time series raster map | global | gee |
remote server | using Koen Hufken’s gee_suset library |
ETOPO1 | raster map | global | etopo1 |
local files | |
Mauna Loa CO2 | time series | site | co2_mlo |
remote server | using the climate R package |
HWSD |
|
||||
WWF Ecoregions | shapefile map | global | wwf |
local files | Olsen et al. (2001) |
N deposition | time series raster map | global | ndep |
local files | Lamarque et al. (2011) |
SoilGrids | raster map | global | soilgrids |
remote server | Hengl et al. (2017) |
ISRIC WISE30sec | raster map | global | wise |
local files | Batjes (2016) |
GSDE Soil | raster map | global | gsde |
local files | Shangguan et al. 2014 |
WorldClim | raster map | global | gsde |
local files | Fick & Hijmans, 2017 |
Examples to read data for a single site for each data type are given in Section ‘Examples for a single site’. Handling ingestion for multiple sites is described in Section ‘Example for a set of sites’. Note that this package does not provide the original data. Please follow links to data sources above where data is read from local files, and always cite original references.
All ingested data follows standardized variable naming and (optionally) units.
Variable | Variable name | Units |
---|---|---|
Gross primary production | gpp |
g CO m |
Air temperature | temp |
C |
Daily minimum air temperature | tmin |
C |
Daily maximum air temperature | tmax |
C |
Precipitation | prec |
mm s |
Vapour pressure deficit | vpd |
Pa |
Atmospheric pressure | patm |
Pa |
Net radiation | netrad |
J m s W m |
Photosynthetic photon flux density | ppfd |
mol m s |
Elevation (altitude) | elv |
m a.s.l. |
Use these variable names for specifying which variable names they
correspond to in the original data source (see argument
getvars
to functions ingest()
and
ingest_bysite()
). gpp
is cumulative,
corresponding to the time scale of the data. For example, if daily data
is read, gpp
is the total gross primary production per day
(g
CO
m
d).
The function ingest_bysite()
can be used to ingest data
for a single site. The argument source
specifies which data
type (source) is to be read from and triggers the use of
specific wrapper functions that are designed to read from original files
with formats that differ between sources. Source-specific settings for
data processing can be provided by argument settings
(described for each data source below). More info about other,
source-independent arguments are available through the man page (see
?ingest_bysite
).
Reading from FLUXNET files offers multiple settings to be used
specified by the user. Here, we’re specifying that no soil water content
data is read (getswc = FALSE
in
settings_fluxnet
, passed to ingest_bysite()
through argument settings
).
settings_fluxnet <- list(getswc = FALSE)
df_fluxnet <- ingest_bysite(
sitename = "FR-Pue",
source = "fluxnet",
getvars = list(temp = "TA_F",
prec = "P_F",
vpd = "VPD_F",
ppfd = "SW_IN_F",
netrad = "NETRAD",
patm = "PA_F"),
dir = paste0(path.package("ingestr"), "/extdata/"), # example file delivered through package and located here
settings = settings_fluxnet,
timescale = "d",
year_start = 2007,
year_end = 2007,
verbose = FALSE
)
df_fluxnet
getvars
defines the variable names in the original files
corresponding to the respective variables with ingestr-standard naming
(see table above). The example above triggers the ingestion of the six
variables
"TA_F", "P_F", "VPD_F", "SW_IN_F", "NETRAD", "PA_F"
for
"temp", "prec", "vpd", "ppfd", "netrad", "patm"
,
respectively.
The same function can also be used to read in other FLUXNET variables
(e.g., CO2 flux data) and conduct data filtering steps. Here, we’re
reading daily GPP and uncertainty (standard error), based on the
nighttime flux decomposition method ("GPP_NT_VUT_REF"
and
"GPP_NT_VUT_SE"
in argument getvars
). The
settings
argument can be used again to specify settings
that are specific to the "fluxnet"
data source. Here, we
keep only data where at least 80% is based on non-gapfilled half-hourly
data (threshold_GPP = 0.8
), and where the daytime and
nighttime-based estimates are consistent, that is, where their
difference is below the the 97.5% and above the 2.5% quantile
(filter_ntdt = TRUE
). Negative GPP values are not removed
(remove_neg = FALSE
). We read data for just one year here
(2007).
settings_fluxnet <- list(
getswc = FALSE,
filter_ntdt = TRUE,
threshold_GPP= 0.8,
remove_neg = FALSE
)
ddf_fluxnet <- ingest_bysite(
sitename = "FR-Pue",
source = "fluxnet",
getvars = list( gpp = "GPP_NT_VUT_REF",
gpp_unc = "GPP_NT_VUT_SE"),
dir = paste0(path.package("ingestr"), "/extdata/"),
settings = settings_fluxnet,
timescale = "d",
year_start = 2007,
year_end = 2007
)
The argument settings
in functions
ingest_bysite()
and ingest()
is used to pass
settings that are specific to the data source (argument
source
) with which the functions are used. Default settings
are specified for each data source. For source = "fluxnet"
,
defaults are returned by a function call of
get_settings_fluxnet()
and are described in the function’s
man page (see ?get_settings_fluxnet
). Defaults are used for
settings elements that are not specified by the user.
Let’s extract data for the location corresponding to FLUXNET site
‘CH-Lae’ (lon = 8.365, lat = 47.4781). This extracts from original
WATCH-WFDEI files, provided as NetCDF (global, 0.5 degree resolution),
provided as monthly files containing all days in each month. The data
directory specified here (dir = "~/data/watch_wfdei/"
)
contains sub-directories with names containing the variable names. The
argument getvars
works a differently compared to
"fluxnet"
. Here, getvars
is a vector of
ingestr-standard variable names to be read. ingestr automatically reads
from the respective files with WATCH-WFDEI variable names. Available
variables are: "temp", "ppfd", "vpd", "patm", "prec"
. The
latter is the sum of snow and rain. Below, we read data for just one
year here (2007).
A bias correction may be applied by specifying the settings as in the
example below. By specifying correct_bias = "worldclim"
(the only option currently available), this uses a high-resolution
(30’’) monthly climatology based on years 1970-2000 and corrects the
WATCH-WFDEI data by month, based on the difference (ratio for variables
other than temperature) of its monthly means, averaged across
1979-2000.
WATCH-WFDEI data is available for years from 1979. If
year_start
is before that, the mean seasonal cycle,
averaged across 1979-1988 is returned for all years before 1979.
df_watch <- ingest_bysite(
sitename = "FR-Pue",
source = "watch_wfdei",
getvars = c("temp"),
dir = "~/data/watch_wfdei/",
timescale = "d",
year_start = 2018,
year_end = 2018,
lon = 3.5958,
lat = 43.7414,
verbose = TRUE
#settings = list(correct_bias = "worldclim", dir_bias = "~/data/worldclim")
)
df_watch
As above, let’s extract CRU data for the location corresponding to
FLUXNET site ‘FR-Pue’ (lon = 8.365, lat = 47.4781). The argument
getvars
works the same way as for WATCH-WFDEI: is a vector
of ingestr-standard variable names to be read. ingestr automatically
reads from the respective files with CRU variable names. Available
variables are:
"tmin", "tmax", "temp", "vpd", "ccov", "wetd"
.
Note that we’re using tmx
(the daily maximum
temperature). This extracts monthly data from the CRU TS data.
Interpolation to daily values is done using a weather generator for
daily precipitation (given monthly total precipitation and number of wet
days in each month), and a polynomial that conserves monthly means for
all other variables.
df_cru <- ingest_bysite(
sitename = "FR-Pue",
source = "cru",
getvars = c("tmin", "tmax"),
dir = "~/data/cru/ts_4.05/",
timescale = "d",
year_start = 2007,
year_end = 2007,
lon = 3.5958,
lat = 43.7414,
verbose = FALSE
)
df_cru
We can compare the temperature recorded at the site and the temperature data extracted from WATCH-WFDEI and CRU.
df <- df_fluxnet %>%
rename(temp_fluxnet = temp) %>%
left_join(rename(df_watch, temp_watch = temp), by = c("sitename", "date")) %>%
left_join(rename(df_cru, temp_min_cru = tmin, temp_max_cru = tmax), by = c("sitename", "date")) %>%
pivot_longer(cols = c(temp_fluxnet, temp_watch, temp_min_cru, temp_max_cru), names_to = "source", values_to = "temp", names_prefix = "temp_")
library(ggplot2)
df %>%
ggplot(aes(x = date, y = temp, color = source)) +
geom_line()
Looks sweet.
Let’s have a look at the hourly climate data from the WFDE5 dataset.
Again, let’s extract meth data for the location corresponding to FLUXNET
site ‘FR-Pue’ (lon = 8.365, lat = 47.4781). All input arguments work the
same way as described above for WATCH-WFDEI and CRU TS. Note that we are
setting timescale = "h"
here to obtain an hourly dataframe.
Available variables are:
"temp", "ppfd", "vpd", "patm", "prec", "wind", "swin", "lwin"
.
df_wfde5 <- ingest_bysite(
sitename = "FR-Pue",
source = "wfde5",
getvars = c("temp"),
dir = "~/data/wfde5/",
timescale = "h",
year_start = 2007,
year_end = 2007,
lon = 3.5958,
lat = 43.7414,
verbose = TRUE,
settings = list(correct_bias = "worldclim", dir_bias = "~/data/worldclim")
)
df_wfde5
This uses the MODISTools R package
making its interface consistent with ingestr. Settings can be specified
and passed on using the settings
argument. To facilitate
the selection of data products and bands to be downloaded, you may use
the function get_settings_modis)
which defines defaults for
different data bundles
(c("modis_fpar", "modis_ndvi", "modis_evi")
are
available).
"modis_fpar"
: MODIS collection 6, MCD15A3H, band
Fpar_500m
"modis_lai"
: MODIS collection 6, MCD15A3H, band
Lai_500m
"modis_evi"
: MODIS collection 6, MOD13Q1, band
250m_16_days_EVI
"modis_ndvi"
: MODIS collection 6, MOD13Q1, band
250m_16_days_NDVI
"modis_refl"
: MODIS/Terra+Aqua Nadir BRDF-Adjusted
Reflectance (NBAR) Daily L3 Global 500 m SIN Grid, all bands"modis_lst"
: MODIS/Terra Land Surface
Temperature/Emissivity 8-Day L3 Global 1 km SIN Grid (MOD11A2 v006)The filtering criteria are hard-coded specifically for each product,
using its respective quality control information (see function
gapfill_interpol()
in
R/ingest_modis_bysite.R
). For more information on the
settings do ?get_settings_modis
.
The following example is for downloading MODIS FPAR MCD15A3H data.
Note the specification of the argument network = "FLUXNET"
.
This triggers the download of prepared subsets aligning with site
locations for different networks (see here) which is much faster than
the download of data for arbitrary locations. This also makes the
specification of longitude and latitude values in the call to
ingest_bysite()
obsolete and downloads a scene of 17 x 17
pixels. Using n_focal
in get_settings_modis()
subsets the scene to central pixels where the value provided for
n_focal
is the distance in number of pixels away from the
center pixel to be taken for averaging. This is done the same way for
the network and non-network ingest options.
settings_modis <- get_settings_modis(
bundle = "modis_fpar",
data_path = "~/data/modis_subsets/",
method_interpol = "loess",
keep = TRUE,
overwrite_raw = FALSE,
overwrite_interpol= TRUE,
n_focal = 0,
network = "FLUXNET"
)
This can now be used to download the data to the directory specified
by argument data_path
of function
get_settings_gee()
. The data downloaded through MODISTools
is then stored in <data_path>/raw/
. When calling the
functions ingest()
or ingest_bysite()
with the
setting overwrite_raw = FALSE
, the raw data file is read
and not re-downloaded if available locally. Raw data contains
information only for dates where MODIS data is provided.
ingest()
and ingest_bysite()
interpolate to
daily values following the setting method_interpol
.
Note also that downloaded raw data are cutouts including pixels of 1
km within the focal point indicated by the site longitude and latitude
(using arguments km_lr = 1.0
and km_ab = 1.0
in the MODISTools::mt_subset()
call). This is hard-coded in
ingestr. To select a smaller radius of pixels around the focal point
included for taking the mean, set the setting n_focal
to an
integer (0:N), with 0 selecting only the single centre pixel in which
the site is located, N=1 for including one pixel around the centre (nine
in total), N=2 for 25 in total etc.
df_modis_fpar <- ingest_bysite(
sitename = "CH-Lae",
source = "modis",
year_start= 2018,
year_end = 2019,
# lon = 8.36439, # not needed when network = "FLUXNET"
# lat = 47.47833, # not needed when network = "FLUXNET"
settings = settings_modis,
verbose = FALSE
)
Plot this data.
plot_fapar_ingestr_bysite(
df_modis_fpar,
settings_modis)
The library gee_subset
by Koen Hufkens can be downloaded
from this link and
used to extract data directly from Google Earth Engine. Note that this
requires the following programmes to be available:
brew install git
.Then, carry out the follwing steps:
To get access to using the Google Earth Engine API (required to use
the gee_subset
library), carry out the following steps in
your terminal. This follows steps described here.
I had an error and first had to do this here following this link:
To facilitate the selection of data products and bands to be
downloaded, you may use the function get_settings_gee()
which defines defaults for different data bundles
(c("modis_fpar", "modis_evi", "modis_lai", "modis_gpp")
are
available).
"modis_fpar"
: MODIS/006/MCD15A3H, band Fpar"modis_evi"
: MODIS/006/MOD13Q1, band EVI"modis_lai"
: MOD15A2, band Lai_1km
"modis_gpp"
: MODIS/006/MOD17A2H, band GppThe following example is for downloading MODIS FPAR data.
settings_gee <- get_settings_gee(
bundle = "modis_fpar",
python_path = system("which python", intern = TRUE),
gee_path = "~/google_earth_engine_subsets/gee_subset/",
data_path = "~/data/gee_subsets/",
method_interpol = "linear",
keep = TRUE,
overwrite_raw = FALSE,
overwrite_interpol= TRUE
)
This can now be used to download the data to the directory specified
by argument data_path
of function
get_settings_gee()
.
df_gee_modis_fpar <- ingest_bysite(
sitename = "CH-Lae",
source = "gee",
year_start= 2009,
year_end = 2010,
lon = 8.36439,
lat = 47.47833,
settings = settings_gee,
verbose = FALSE
)
Ingesting CO2 data is particularly simple. We can safely assume it’s
well mixed in the atmosphere (independent of site location), and we can
use a annual mean value for all days in respective years, and use the
same value for all sites. Using the R package climate, we can load
CO2 data from Mauna Loa directly into R. This is downloading data from
ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt.
Here, ingest()
is a wrapper for the function
climate::meteo_noaa_co2()
.
df_co2 <- ingest_bysite(
sitename = "CH-Lae",
source = "co2_mlo",
year_start= 2007,
year_end = 2014,
verbose = FALSE
)
Argument dir
can be provided here, too. In that case,
CO2 data is written (after download if it’s not yet available) and read
to/from a file located at <dir>/df_co2_mlo.csv
.
More info about the climate package and the data can be obtained here and by:
?climate::meteo_noaa_co2
THE FOLLOWING IS UNDER CONSTRUCTION. MAKE READABLE FOR FILE AVAILABLE HERE: http://www.pik-potsdam.de/~mmalte/rcps/
Mauna Loa CO2 is not available for years before 1958. Alternative CO2
data is from CMIP standard forcing with merged time series from
atmospheric measurements and ice core reconstructions. This can be
selected with source = "co2_cmip
.
df_co2 <- ingest_bysite(
sitename = "CH-Lae",
source = "co2_cmip",
year_start= 2007,
year_end = 2014,
verbose = FALSE,
dir = "~/data/co2"
)
Four steps are required before you can use
ingest_bysite()
to get HWSD
data:
if(!require(devtools)){install.packages(devtools)}
devtools::install_github("stineb/rhwsd")
list.of.packages <- c("DBI", "RSQLite")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
settings
argument accordingly (in this example:
"~/data/hwsd/HWSD_RASTER/hwsd.bil"
).Then, use similarly to above, with providing the path to the
downloaded file with the settings
argument:
df_hwsd <- ingest_bysite(
sitename = "CH-Lae",
source = "hwsd",
lon = 8.36439,
lat = 47.47833,
settings = list(fil = "~/data/hwsd/HWSD_RASTER/hwsd.bil"),
verbose = FALSE
)
This reads nitrogen deposition from global annual maps by Lamarque et al. (2011). This provides annual data separately for NHx and NOy in gN m yr from a global map provided at half-degree resolution and covering years 1860-2009.
df_ndep <- ingest_bysite(
sitename = "CH-Lae",
source = "ndep",
lon = 8.36439,
lat = 47.47833,
year_start= 2000,
year_end = 2009,
timescale = "y",
dir = "~/data/ndep_lamarque/",
verbose = FALSE
)
Point extractions from SoilGrids layers are implemented following this and are provided through ISRIC.
Available layers and variable naming conventions are described here.
Which variable is to be extracted and for which soil depth layer can be
specified in the settings, a list returned by the function call
get_settings_soilgrids()
.
Available variables are described in the table below. Conversion facto are applied by ingestr. Hence, the returned data is in units as described in the table below as “Conventional units”.
Name | Description | Mapped units | Conversion factor | Conventional units |
---|---|---|---|---|
bdod | Bulk density of the fine earth fraction | cg/cm³ | 100 | kg/dm³ |
cec | Cation Exchange Capacity of the soil | mmol(c)/kg | 10 | cmol(c)/kg |
cfvo | Volumetric fraction of coarse fragments (> 2 mm) | cm3/dm3 (vol‰) | 10 | cm3/100cm3 (vol%) |
clay | Proportion of clay particles (< 0.002 mm) in the fine earth fraction | g/kg | 10 | g/100g (%) |
nitrogen | Total nitrogen (N) | cg/kg | 100 | g/kg |
phh2o | Soil pH | pHx10 | 10 | pH |
sand | Proportion of sand particles (> 0.05 mm) in the fine earth fraction | g/kg | 10 | g/100g (%) |
silt | Proportion of silt particles (≥ 0.002 mm and ≤ 0.05 mm) in the fine earth fraction | g/kg | 10 | g/100g (%) |
soc | Soil organic carbon content in the fine earth fraction | dg/kg | 10 | g/kg |
ocd | Organic carbon density | hg/dm³ | 10 | kg/dm³ |
ocs | Organic carbon stocks | t/ha | 10 | kg/m² |
Data is available for the following six layers.
Layer | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
Top depth (cm) | 0 | 5 | 15 | 30 | 60 | 100 |
Bottom depth (cm) | 5 | 15 | 30 | 60 | 100 | 200 |
The specify which data is to be ingested define the settings using
the function get_settings_soilgrids()
, and provide standard
variable names as a vector of character strings for argument
varnam
, and layers as a vector of integers for argument
layer
. For example:
settings_soilgrids <- get_settings_soilgrids(varnam = c("nitrogen", "cec"), layer = 1:3)
The ingested data is then averaged across specified layers, weighted with respective layer depths.
Now, the data can be ingested.
df_soilgrids <- ingest_bysite(
sitename = "CH-Lae",
source = "soilgrids",
lon = 8.36439,
lat = 47.47833,
settings = settings_soilgrids
)
This returns a data frame with a nested column data
which contains actually just a 1 x 1 tibble. This is to be consistent
with other ingest options. You may prefer to have a normal flat data
frame. Just do:
This reads from local files. Download them from ISRIC here.
Point extraction from the global gridded WISE30sec data product (Batjes et al., 2016) can be done for a set of variables and soil layers by specifying the ingest demand in the settings. ingestr returns data as a mean across available map units for selected location (pixel), weighted by the fractional coverage of map units for this pixel. The table below describes available variables (info based on ISRIC Report 2015/01).
Name | Description |
---|---|
CFRAG | Coarse fragments (vol. % > 2mm), mean |
SDTO | Sand (mass %), mean |
STPC | Silt (mass %) |
CLPC | Clay (mass %) |
PSCL | Texture class (SOTER conventions) |
BULK | Bulk density (kg dm-3, g cm-3) |
TAWC | Available water capacity (cm m-1, -33 to -1500 kPa, conform USDA standards) Standard deviation for above |
CECS | Cation exchange capacity (cmol kg-1) of fine earth fraction |
BSAT | Base saturation as percentage of CECsoil |
ESP | Exchangeable sodium percentage |
CECc | CECclay, corrected for contribution of organic matter (cmol kg-1) |
PHAQ | pH measured in water |
TCEQ | Total carbonate equivalent (g C kg-1) |
GYPS | Gypsum content (g kg-1) |
ELCO | Electrical conductivity (dS m-1) |
ORGC | Organic carbon content (g kg-1) |
TOTN | Total nitrogen (g kg-1) |
CNrt | C/N ratio |
ECEC | Effective CEC (cmol kg-1) |
ALSA | Aluminum saturation (as % of ECEC) |
By default, data is extracted for the top layer only. Data is provided for the following seven layers (depths in cm).
Layer | Top depth | Bottom depth |
---|---|---|
1 | 0 | 20 |
2 | 20 | 40 |
3 | 40 | 60 |
4 | 60 | 80 |
5 | 80 | 1000 |
6 | 100 | 150 |
7 | 150 | 200 |
The following settings specify data extraction for the C:N ratio of
the top three layers. The returned value is the mean across selected
soil layers, weighted by the respective layer’s depth. dir
specifies the path to the downloaded data bundle. Don’t change the
structure of it. ingestr reads from two files:
<dir>/GISfiles/wise30sec_fin
and
<dir>/Interchangeable_format/HW30s_FULL.txt
.
settings_wise <- get_settings_wise(varnam = c("CNrt"), layer = 1:7)
Now, the data can be ingested.
df_wise <- ingest_bysite(
sitename = "CH-Lae",
source = "wise",
lon = 8.36439,
lat = 47.47833,
settings = settings_wise,
dir = "~/data/soil/wise"
)
Global Soil Dataset for use in Earth System Models (GSDE) by Shangguan et al. 2014, obtained from here. Available variables are given in the table below.
No. | Attribute | units | variable name |
---|---|---|---|
1 | total carbon | %of weight | TC |
2 | organic carbon | %of weight | OC |
3 | total N | %of weight | TN |
7 | pH(H2O) | PHH2O | |
8 | pH(KCl) | PHK | |
9 | pH(CaCl2) | PHCA | |
15 | Exchangeable aluminum | cmol/kg | EXA |
27 | The amount of P using the Bray1 method | ppm of weight | PBR |
28 | The amount of P by Olsen method | ppm of weight | POL |
29 | P retention by New Zealand method | % of weight | PNZ |
30 | The amount of water soluble P | ppm of weight | PHO |
31 | The amount of P by Mehlich method | ppm of weight | PMEH |
33 | Total P | % of weight | TP |
34 | Total potassium | % of weight | TK |
The 8 layers are:
df_layers <- tibble(layer = 1:8, bottom = c(4.5, 9.1, 16.6, 28.9, 49.3, 82.9, 138.3, 229.6)) %>%
mutate(top = lag(bottom)) %>%
mutate(top = ifelse(is.na(top), 0, top))
df_layers
Specify the settings directly as a list with elements
varnam
(a vector of character strings specifying the
variables as defined in the table above), and layer
(a
vector of integers specifying the layers over which a depth-weighted
average is taken).
Now, the data can be ingested.
df_gsde <- ingest_bysite(
sitename = "CH-Lae",
source = "gsde",
lon = 8.36439,
lat = 47.47833,
settings = settings_gsde,
dir = "~/data/soil/shangguan"
)
And data is returned with variables along columns inside a nested
column data
, and sites along rows (as for all ingestr).
Make it flat by:
This ingests Worldclim monthly climatology (averaged over 1970-2000) at 30 seconds spatial resolution by Fick & Hijmans, 2017, obtained here. Available variables are:
Variable name | Description | Units |
---|---|---|
bio | Bioclimatic variables (description here) | |
tmin | Minimum temperature | °C |
tmax | Maximum temperature | °C |
tavg | Average temperature | °C |
prec | Precipitation | mm |
srad | Solar radiation | kJ m-2 day-1 |
wind | Wind speed | m s-1 |
vapr | Water vapour pressure | kPa |
Specify the settings directly as a list with elements
varnam
(a vector of character strings specifying the
variables as defined in the table above), and layer
(a
vector of integers specifying the layers over which a depth-weighted
average is taken).
Now, the data can be ingested.
df_worldclim <- ingest_bysite(
sitename = "CH-Lae",
source = "worldclim",
lon = 8.36439,
lat = 47.47833,
settings = settings_worldclim,
dir = "/data/archive/worldclim_fick_2017/data"
)
And for Flat-Earthers:
To collect data from an ensemble of sites, we have to define a meta
data frame, here called siteinfo
, with rows for each site
and columns lon
for longitude, lat
for
latitude, date_start
and date_end
for required
dates (Dates are objects returned by a lubridate::ymd()
function call - this stands for year-month-day). The function
ingest()
can then be used to collect all site-level data as
a nested data frame corresponding to the metadata siteinfo
with an added column named data
where the time series of
ingested data is nested inside.
Note that extracting for an ensemble of sites at once is more
efficient for data types that are global files (WATCH-WFDEI, and CRU).
In this case, the raster
package can be used to efficiently
ingest data.
First, define a list of sites and get site meta information. The
required meta information is provided by the exported data frame
siteinfo
(it comes as part of the ingestr package). This
file is created as described in (and using code from) metainfo_fluxnet2015.
mysites <- c("BE-Vie", "DE-Tha", "DK-Sor", "FI-Hyy", "IT-Col", "NL-Loo", "US-MMS", "US-WCr", "US-UMB", "US-Syv", "DE-Hai")
siteinfo <- ingestr::siteinfo_fluxnet2015 %>%
dplyr::filter(sitename %in% mysites) %>%
dplyr::mutate(date_start = lubridate::ymd(paste0(year_start, "-01-01"))) %>%
dplyr::mutate(date_end = lubridate::ymd(paste0(year_end, "-12-31")))
This file looks like this:
print(siteinfo)
Next, the data can be ingested for all sites at once. Let’s do it for different data types again.
This ingests meteorological data from the FLUXNET files for variables
temperature, precipitation, VPD, shortwave incoming radiation, net
radiation, and atmospheric pressure. Arguments that are specific for
this data source are provided in the settings
list.
ddf_fluxnet <- ingest(
siteinfo = siteinfo %>% slice(1:3),
source = "fluxnet",
getvars = list(temp = "TA_F",
prec = "P_F",
vpd = "VPD_F",
ppfd = "SW_IN_F",
netrad = "NETRAD",
patm = "PA_F"
),
dir = "/data/scratch/bstocker/FLUXNET-2015_Tier1/20191024/DD/", # adjust this with your local path
settings = list(
dir_hh = "/data/scratch/bstocker/FLUXNET-2015_Tier1/20191024/HH/", # adjust this with your local path
getswc = FALSE),
timescale = "d",
verbose = TRUE
)
Additional variables defined at a daily time scale can be derived from half-hourly data. For example daily minimum temperature can be obtained as follows:
ddf_tmin <- ingest(
siteinfo = siteinfo %>% slice(1:3),
source = "fluxnet",
getvars = list(tmin = "TMIN_F"),
dir = "/data/scratch/bstocker/FLUXNET-2015_Tier1/20191024/DD/", # adjust this with your local path
settings = list(
dir_hh = "/data/scratch/bstocker/FLUXNET-2015_Tier1/20191024/HH/", # adjust this with your local path
getswc = FALSE),
timescale = "d",
verbose = TRUE
)
As described above for a single site, the same function can also be
used to read in other FLUXNET variables (e.g., CO2 flux data) and
conduct data filtering steps. Here, we’re reading daily GPP and
uncertainty (standard error), based on the nighttime flux decomposition
method (""GPP_NT_VUT_REF""
), keep only data where at least
80% is based on non-gapfilled half-hourly data
(threshold_GPP = 0.8
), and where the daytime and
nighttime-based estimates are consistent, that is, where their
difference is below the the 97.5% and above the 2.5% quantile
(filter_ntdt = TRUE
, see also
?get_obs_bysite_fluxnet2015
).
settings_fluxnet <- list(
getswc = FALSE,
filter_ntdt = TRUE,
threshold_GPP= 0.8,
remove_neg = FALSE
)
ddf_fluxnet_gpp <- ingest(
siteinfo = siteinfo %>% slice(1:3),
source = "fluxnet",
getvars = list(gpp = "GPP_NT_VUT_REF",
pp_unc = "GPP_NT_VUT_SE"),
dir = "/data/scratch/bstocker/FLUXNET-2015_Tier1/20191024/DD/", # adjust this with your local path
settings = settings_fluxnet,
timescale= "d"
)
This extracts from original WATCH-WFDEI files, provided as NetCDF
(global, 0.5 degree resolution), provided as monthly files containing
all days in each month. The data directory specified here
(dir = "~/data/watch_wfdei/"
) contains subdirectories with
names containing the variable names (corresponding to the ones specified
by the argument getvars = list(temp = "Tair")
).
A bias correction may be applied by specifying the settings as in the
example below. By specifying correct_bias = "worldclim"
(the only option currently available), this uses a high-resolution
(30’’) monthly climatology based on years 1970-2000 and corrects the
WATCH-WFDEI data by month, based on the difference (ratio for variables
other than temperature) of its monthly means, averaged across
1979-2000.
WATCH-WFDEI data is available for years from 1979. If
year_start
is before that, the mean seasonal cycle,
averaged across 1979-1988 is returned for all years before 1979.
This extracts monthly data from the CRU TS data. Interpolation to daily values is done using a wather generator for daily precipitation (given monthly total precipitation and number of wet days in each month), and a polynomial that conserves monthly means for all other variables.
ddf_cru <- ingest(
siteinfo = siteinfo %>% slice(1:2),
source = "cru",
getvars = c("tmax"),
dir = "~/data/cru/ts_4.01/" # adjust this with your local path
)
Check it out for the first site (BE-Vie).
This uses the MODISTools R package
making its interface consistent with ingestr. Settings can be specified
and passed on using the settings
argument. To facilitate
the selection of data products and bands to be downloaded, you may use
the function get_settings_modis)
which defines defaults for
different data bundles
(c("modis_fpar", "modis_ndvi", "modis_evi")
are
available).
"modis_fpar"
: MODIS collection 6, MCD15A3H, band
Fpar_500m
"modis_lai"
: MODIS collection 6, MCD15A3H, band
Lai_500m
"modis_evi"
: MODIS collection 6, MOD13Q1, band
250m_16_days_EVI
"modis_ndvi"
: MODIS collection 6, MOD13Q1, band
250m_16_days_NDVI
The filtering criteria are hard-coded specifically for each product,
using its respective quality control information (see function
gapfill_interpol()
in
R/ingest_modis_bysite.R
).
Downloading with parallel jobs is available for the
"modis"
data ingest, using the package multidplyr. This is
not (yet) available on CRAN, but can be installed with
devtools::install_github("tidyverse/multidplyr")
. To do
parallel downloading, set the following arguments in the function
ingest()
:
parallel = TRUE, ncores = <number_of_parallel_jobs>
.
The following example is for downloading MODIS NDVI data.
settings_modis <- get_settings_modis(
bundle = "modis_ndvi",
data_path = "~/data/modis_subsets/",
method_interpol = "loess",
keep = TRUE,
overwrite_raw = FALSE,
overwrite_interpol= TRUE,
network = "FLUXNET"
)
This can now be used to download the data to the directory specified
by argument data_path
of function
get_settings_gee()
.
df_modis_fpar <- ingest(
siteinfo_fluxnet2015 %>% slice(1:3),
source = "modis",
settings = settings_modis,
parallel = FALSE
)
This can now be used to download the data to the directory specified
by argument data_path
of function
get_settings_gee()
. The data downloaded through MODISTools
is then stored in <data_path>/raw/
. When calling the
functions ingest()
or ingest_bysite()
with the
setting overwrite_raw = FALSE
, the raw data file is read
and not re-downloaded if available locally. Raw data contains
information only for dates where MODIS data is provided.
ingest()
and ingest_bysite()
interpolate to
daily values following the setting method_interpol
.
Note also that downloaded raw data are cutouts including pixels of 1
km within the focal point indicated by the site longitude and latitude
(using arguments km_lr = 1.0
and km_ab = 1.0
in the MODISTools::mt_subset()
call). This is hard-coded in
ingestr. To select a smaller radius of pixels around the focal point
included for taking the mean, set the setting n_focal
to an
integer (0:N), with 0 selecting only the single centre pixel in which
the site is located, N=1 for including one pixel around the centre (nine
in total), N=2 for 25 in total etc.
Plot the ingested data.
plot_fapar_ingestr_bysite(
df_modis_fpar$data[[1]] %>%
dplyr::filter(year(date) %in% 2010:2015),
settings_modis)
plot_fapar_ingestr_bysite(
df_modis_fpar$data[[2]] %>%
dplyr::filter(year(date) %in% 2010:2015),
settings_modis)
plot_fapar_ingestr_bysite(
df_modis_fpar$data[[3]] %>%
dplyr::filter(year(date) %in% 2010:2015),
settings_modis)
Using the same settings as specified above, we can download MODIS FPAR data for multiple sites at once from GEE:
settings_gee <- get_settings_gee(
bundle = "modis_fpar",
python_path = system("which python", intern = TRUE),
gee_path = "~/google_earth_engine_subsets/gee_subset/", # adjust this with your local path
data_path = "~/data/gee_subsets/", # adjust this with your local path
method_interpol = "linear",
keep = TRUE,
overwrite_raw = FALSE,
overwrite_interpol= TRUE
)
df_gee_modis_fpar <- ingest(
siteinfo= siteinfo,
source = "gee",
settings= settings_gee,
verbose = FALSE
)
Collect all plots.
list_gg <- plot_fapar_ingestr(df_gee_modis_fpar, settings_gee)
#purrr::map(list_gg, ~print(.))
Ingesting CO2 data is particularly simple. We can safely assume it’s
well mixed in the atmosphere (independent of site location), and we can
use a annual mean value for all days in respective years, and use the
same value for all sites. Using the R package climate, we can load
CO2 data from Mauna Loa directly into R. This is downloading data from
ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt.
Here, ingest()
is a wrapper for the function
climate::meteo_noaa_co2()
.
df_co2 <- ingest(
siteinfo,
source = "co2_mlo",
verbose = FALSE
)
Argument dir
can be provided here, too. In that case,
CO2 data is written (after download if it’s not yet available) and read
to/from a file located at <dir>/df_co2_mlo.csv
.
More info about the climate package and the data can be obtained here and by:
?climate::meteo_noaa_co2
THE FOLLOWING IS UNDER CONSTRUCTION. MAKE READABLE FOR FILE AVAILABLE HERE: http://www.pik-potsdam.de/~mmalte/rcps/
Mauna Loa CO2 is not available for years before 1958. Alternative CO2
data is from CMIP standard forcing with merged time series from
atmospheric measurements and ice core reconstructions. This can be
selected with source = "co2_cmip
.
df_co2 <- ingest(
siteinfo,
source = "co2_cmip",
dir = "~/data/co2"
)
This reads from the 1 arc minutes resolution ETOPO1 global elevation
data (reading from a Geo-TIFF file). The nested data column contains a
tibble one value for variable elv
. Download the data from
here and specify the
local path with the argument dir
.
df_etopo <- ingest(
siteinfo,
source = "etopo1",
dir = "~/data/etopo/" # adjust this with your local path
)
This reads from the 0.05 degrees resolution map of root-zone
water-storage capacity from Stocker et
al. (2023). The nested data column contains a tibble with one value
for one variable whc
. Download the data from here and specify the
local path with the argument dir
.
df_stocker23 <- ingest(
siteinfo,
source = "stocker23",
dir ="~/data/mct_data/" # adjust this with your local path
)
Four steps are required before you can use ingest()
to
get HWSD data:
if(!require(devtools)){install.packages(devtools)}
devtools::install_github("stineb/rhwsd")
# XXX to change to https://github.com/bluegreen-labs/hwsdr
list.of.packages <- c("DBI", "RSQLite")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
settings
argument accordingly (in this example:
"~/data/hwsd/HWSD_RASTER/hwsd.bil"
).Then, use similarly to above, with providing the path to the
downloaded file with the settings
argument:
WWF Ecoregions data are provided as a shapefile, available for
download here,
or here.
A description of the data is available here.
Download the zipped directory and adjust the argument dir
to the path of the directory where file wwf_terr_ecos.shp
is located. Set the settings list with
layer = "wwf_terr_ecos"
. Then, ingest data by:
df_wwf <- ingest(
siteinfo,
source = "wwf",
dir = "~/data/biomes/wwf_ecoregions/official/",
settings = list(layer = "wwf_terr_ecos")
) |>
unnest(data) |>
select(sitename, lon, lat, BIOME_NAME)
The following provides the biome codes. This information is
additionally added by the ingestr package in column
BIOME_NAME
:
Code | Biome |
---|---|
1 | Tropical & Subtropical Moist Broadleaf Forests |
2 | Tropical & Subtropical Dry Broadleaf Forests |
3 | Tropical & Subtropical Coniferous Forests |
4 | Temperate Broadleaf & Mixed Forests |
5 | Temperate Conifer Forests |
6 | Boreal Forests/Taiga |
7 | Tropical & Subtropical Grasslands, Savannas & Shrublands |
8 | Temperate Grasslands, Savannas & Shrublands |
9 | Flooded Grasslands & Savannas |
10 | Montane Grasslands & Shrublands |
11 | Tundra |
12 | Mediterranean Forests, Woodlands & Scrub |
13 | Deserts & Xeric Shrublands |
14 | Mangroves |
Please cite as: Olson, D. M., Dinerstein, E. ,Wikramanayake, E. D., Burgess, N. D., Powel, G. V. N., Underwood, E. C., Damico, J. A., Itoua, I., Strand, H. E., Morrison, J. C., Loucks, C. J., Ricketts, T. H., Kura, Y., Lamoreux, J. F., Wettengel, W. W., Hedao, P., and Kassem, K.R. 2001 Terrestrial ecoregions of the world: A new map of life on earth. BioScience, 51(11):933–938.
This reads nitrogen deposition from global annual maps by Lamarque et al. (2011). This provides annual data separately for NHx and NOy in gN m yr from a global map provided at half-degree resolution and covering years 1860-2009.
Point extractions from SoilGrids layers are implemented following this and are provided through ISRIC.
Available layers, variable naming conventions, and units are
described above (section Examples for a single site -
SoilGrids) and here.
Which variable is to be extracted and for which soil depth layer can be
specified in the settings, a list returned by the function call
get_settings_soilgrids()
.
settings_soilgrids <- get_settings_soilgrids(varnam = c("nitrogen", "cec"), layer = 1:3)
Now, the data can be ingested.
df_soilgrids <- ingest(
siteinfo_fluxnet2015 %>% slice(1:3),
source = "soilgrids",
settings = settings_soilgrids
)
This returns a data frame with a nested column data
which contains actually just a 1 x 1 tibble. This is to be consistent
with other ingest options. You may prefer to have a normal flat data
frame. Just do:
This reads from local files. Download them from ISRIC here.
See above (section Examples for a single site) for a description of variables and soil layers.
The following settings specify data extraction for the C:N ratio of
the top three layers. The returned value is the mean across selected
soil layers, weighted by the respective layer’s depth. dir
specifies the path to the downloaded data bundle. Don’t change the
structure of it. ingestr reads from two files:
<dir>/GISfiles/wise30sec_fin
and
<dir>/Interchangeable_format/HW30s_FULL.txt
.
settings_wise <- get_settings_wise(varnam = c("CNrt", "CECS"), layer = 1:3)
Now, the data can be ingested.
Global Soil Dataset for use in Earth System Models (GSDE) by Shangguan et al. 2014, obtained from here. Available variables and layers are given in the table in the section above (Examples for a single site - GSDE Soil)
Specify the settings directly as a list with elements
varnam
(a vector of character strings specifying the
variables as defined in the table above), and layer
(a
vector of integers specifying the layers over which a depth-weighted
average is taken).
Now, the data can be ingested.
df_gsde <- ingest(
siteinfo_fluxnet2015 %>% slice(1:2),
source = "gsde",
settings = settings_gsde,
dir = "/data/archive/soil_shangguan_2014/data/"
)
And data is returned with variables along columns inside a nested
column data
, and sites along rows (as for all ingestr).
Make it flat by:
This ingests Worldclim monthly climatology (averaged over 1970-2000) at 30 seconds spatial resolution by Fick & Hijmans, 2017, obtained here. Available variables are:
Variable name | Description | Units |
---|---|---|
bio | Bioclimatic variables (description here) | |
tmin | Minimum temperature | °C |
tmax | Maximum temperature | °C |
tavg | Average temperature | °C |
prec | Precipitation | mm |
srad | Solar radiation | kJ m-2 day-1 |
wind | Wind speed | m s-1 |
vapr | Water vapour pressure | kPa |
Specify the settings directly as a list with elements
varnam
(a vector of character strings specifying the
variables as defined in the table above), and layer
(a
vector of integers specifying the layers over which a depth-weighted
average is taken).
Now, the data can be ingested.
Ingestr, i.a., designed to collect forcing data for the rsofun
modelling framework. Within that, you may chose to run the P-model for
predicting leaf-level quantities (acclimated Vcmax and Jmax). These are
independent of fAPAR. Still, forcing for fAPAR (variable standard name
fapar
) is required and can be set to 1.0 for all sites and
required dates. To get an object in the required standard format, use
ingest
with source = "fapar_unity"
: