map2tidy_example.Rmd
Geospatial data often has a time dimension. Such temporal geospatial data often comes in the form of multiple files that contain the data of a single time step - in the form of a geospatial map - or in the form of files that each contain the data of a subset of the time steps. This is a problem for analyses and modelling that rely on having access to the whole time series data. To obtain the whole time series, we have to read all files into memory and then run the analyses. In practice, this is often not possible due to memory limitations. What can we do?
We have to open each file, read a subset of the data, close the file, and open the next one and “stitch” the data together along the time axis. Then, repeat this to read the remaining subsets of the data to finally have time series of all geospatial units (usually pixels). This creates a burdensome computational overhead and involves lots of programming. Especially if the analyses are to be carried out in parallel for the spatial units.
This package solves this problem.
The function map2tidy
takes a vector of file names (full
paths) of NetCDF files containing geospatial data and returns a tidy
data frame where each row represents one pixel of the geospatial
(raster) data and the complete time series of each pixel is contained as
a nested data frame in the column data
.
library(map2tidy)
library(dplyr)
library(ggplot2)
# list demo file path
path <- file.path(system.file(package = "map2tidy"),"extdata")
# list demo files
files <- list.files(path, pattern = "demo_data_2017_month", full.names = TRUE)
# load and convert
df <- map2tidy(
nclist = files,
varnam = "et",
lonnam = "lon",
latnam = "lat",
timenam = "time",
timedimnam = "time"
)
head(df)
## # A tibble: 6 × 3
## # Groups: lon, lat [6]
## lon lat data
## <dbl> <dbl> <list>
## 1 -0.0250 30.0 <tibble [365 × 2]>
## 2 -0.0250 30.0 <tibble [365 × 2]>
## 3 -0.0250 30.1 <tibble [365 × 2]>
## 4 -0.0250 30.1 <tibble [365 × 2]>
## 5 -0.0250 30.2 <tibble [365 × 2]>
## 6 -0.0250 30.2 <tibble [365 × 2]>
The complete time series are now nested data frames in column
data
. We can plot them.
When handling large files, having the complete data frame returned
should be avoided to avoid memory overload. Write chunks of the data to
separate files, placed in a directory as specified by the argument
outdir
with file names specified by argument
fileprefix
. The chunks will be along longitudinal bands
(single index in longitude, all indices in latitude).