jfsp_extract.Rmd
This vignette provides an overview of the data extraction and data set curation pipeline for ALFRESCO simulation outputs. It uses the JFSP project as a case study. The processing chain involves two main stages: data extraction and data curation. In the first stage, data is extracted from geotiffs output by ALFRESCO. In the second stage this data is curated and prepared for use across other downstream SNAP projects. There is an initial round of file preparation, but this can be considered a minor step before beginning the overall process of extracting and curating the data.
Not all files output by ALFRESCO are necessary. Since there are so many, it is helpful to make an isolated, smaller copy of only the subset of required files. This also creates an opportunity to reorganize their structure in a way that is more useful and can be made consistent across different ALFRESCO projects.
Create a bash script from R on Atlas. Note the working directory.
library(alfresco)
copy_alf_outputs("JFSP", "/big_scratch/shiny/Runs_Statewide/paul.duffy\\@neptuneinc.org",
domain = "ak1km")
Then exit R. From the command line, run the bash script for each model run, e.g.:
bash copy_alf_outputs.sh historical CRU32 historical fmo00s00i
See the documentation for copy_alf_outputs
for details on command line arguments.
To perform efficient data extraction on ALFRESCO output maps for the JFSP project, the first step is to create a slurm script to be executed by the SLURM job manager on SNAP’s Atlas cluster. This script will run an accompanying R script that performs the extraction on the geotiffs output by ALFRESCO. To generate these two files, do the following:
alf_extract_slurm(domain = "ak1km", project = "JFSP", years = 1950:2013, reps = 1:32,
cru = TRUE)
#!/bin/bash
#SBATCH --exclusive
#SBATCH --mail-type=END
#SBATCH --mail-user=mfleonawicz@alaska.edu
#SBATCH --account=snap
#SBATCH -p main
#SBATCH --ntasks=64 # one for each year
#SBATCH --job-name=alf_extract
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
mpirun --oversubscribe -np 1 Rscript /atlas_scratch/mfleonawicz/alfresco/slurm_jobs/alf_extract.R domain=\'ak1km\' rmpi=TRUE project=\'JFSP\' years=1950:2013 reps=1:32 cru=TRUE $1 $2 $3
This creates a slurm script with some hardcoded arguments that are passed on to the R script. This is convenient because the slurm script will be run several times for each combination of climate models and emissions scenarios (GCMs and RCPs) and it is nice to not have to repeatedly specify several other command line arguments that do not change from one set of ALFRESCO outputs to the next. Note how domain
and project
were coded into the script above. When passing string arguments at the command line to this and other slurm scripts, to be subsequently passed on to R, remember to escape them.
The data extraction procedure makes use of the Rmpi
package with a master R process running on the Atlas head node and 64 slave processes (one for each year) running on 32 CPUs across two different compute nodes.
This function also creates a template R script for standard extractions show below. It reveals that Rmpi
is not a requirement. Alternatively, you can run the same R script on a single compute node, parallelizing over multiple CPUs with the base R parallel
package. However, using Rmpi
on the Atlas cluster offers much greater access to total CPUs and RAM by coordinating the extractions across multiple compute nodes. Note that you do not need to write a script like this. The alfresco
package creates it for you.
library(rgdal)
library(raster)
library(alfresco)
cargs <- (commandArgs(TRUE))
if (!length(cargs)) q("no") else for (i in 1:length(cargs)) eval(parse(text = cargs[[i]]))
extract_alf_stops()
if (!exists("rmpi")) rmpi <- TRUE
if (exists("repSample") && is.numeric(repSample)) {
set.seed(47)
reps <- sort(sample(reps, min(repSample, length(reps))))
cat("Sampled replicates:\n", reps, "\n")
}
if (!exists("cru")) cru <- FALSE
if (rmpi) {
library(Rmpi)
mpi.spawn.Rslaves(needlog = TRUE) # remainder of setup performed by run_alf_extraction
mpi.bcast.cmd(library(rgdal))
mpi.bcast.cmd(library(raster))
mpi.bcast.cmd(library(alfresco))
} else {
library(parallel)
n.cores <- 32
tmp_dir <- paste0(alfdef()$raster_tmp_dir, "procX")
rasterOptions(chunksize = 1e+11, maxmemory = 1e+12, tmpdir = tmp_dir)
}
cells <- get_domain_cells(domain)
veg_labels <- get_veg_labels(domain)
dirs <- get_out_dirs(domain, project, cru)
main_dirs <- rep(paste0(dirs, "/Maps")[modelIndex], each = length(years))
main_dir <- main_dirs[1] # all equal when single modelIndex
run_alf_extraction(domain, type = "fsv", main_dir = main_dir, project = project,
reps = reps, years = years, cells = cells, veg_labels = veg_labels, cru = cru,
rmpi = rmpi)
run_alf_extraction(domain, type = "av", main_dir = main_dir, project = project,
reps = reps, years = years, cells = cells, veg_labels = veg_labels, cru = cru,
rmpi = rmpi)
if (rmpi) {
mpi.close.Rslaves(dellog = TRUE)
mpi.exit()
}
Note that the alfresco
package assumes sensible defaults for file paths and organization based on the Atlas file system. Unless specifying a custom location with out_dir
, the generated scripts will be placed in an Atlas directory based on file paths stored in alfresco::alfdef()
. See the documentation for alf_extract_slurm
for usage details.
The next step is to begin the data extraction from the raster layer data output by ALFRESCO in geotiff files. To do this, execute the slurm script at the Atlas command line, making sure to supply any required arguments that were not written directly into the slurm script when it was previously generated. Assuming you are in the directory where the slurm and R script live and created them as shown above,
alfdef()$alf_slurm_dir
#> [1] "/atlas_scratch/mfleonawicz/alfresco/slurm_jobs"
simply call sbatch
as follows:
sbatch alf_extract.slurm modelIndex=1
The modelIndex
refers to the position in the list of climate model (or climate model/RCP group) directories pertaining to the given ALFRESCO project. For example, if an ALFRESCO project uses five GCMs and 3 RCPs, there are 15 directories of ALFRESCO outputs under the project directory. Each of these contains GCM/RCP-specific run outputs in a Maps
subdirectory. The sbatch
call is made for each of these 15 sets of GCM/RCP outputs.
This example deals only with historical run output over 16 combinations of fire management options as part of an experimental design. You have conveniently built into the slurm script all the arguments that are constant across these 16 treatments, such as the project name, the spatial domain of the ALFRESCO runs, the fact that the runs are based on CRU historical climate data rather than projected GCM outputs, the set of years and the set of simulation replicates. This makes the sbatch
call short, having only to iterate over modelIndex
for all 16 ALFRESCO runs without specifying the rest. For a different ALFRESCO project, you would generate a different slurm script. The template R script run by the slurm script is the same in every case. What is variable is how general or specific the slurm script is, which depends on what is explicitly specified in alf_extract_slurm
.
The extracted data are saved to .rds
files that contain a data frame of fire size by vegetation class for all simulation replicates for the selected model and scenario. There is a unique file for each geographic subdomain in cells
(see above R script), which typically is based on standard region data sets compiled in the snappoly
and snapgrid
R packages. See their associated documentation for details.
These files are used in a subsequent data prep process where finalized data frames are compiled that contain probability distributions of several random variables extracted from ALFRESCO model run outputs.
The final step when extracting data from ALFRESCO output geotiffs is to compute distributions of random variables and store them in prepared tables for later use across a range of SNAP projects.
Random variables whose probability distributions are estimated from ALFRESCO simulation output include:
Each of these random variables has a unique estimated probability distribution associated with each geographic region the data extractions and density estimations were applied to. For areas and counts, these refer to spatially aggregated random variables. Therefore, the distributions of these random variables are across ALFRESCO simulation replicates only. By comparison, when similar distributions are estimated for downscaled climate data, those variables vary in space only and do not have an analog to the set of repeated ALFRESCO simulations. Vegetation age is unique in that the distribution of stand ages is across both simulation replicates and space.
As with data extraction, generate a slurm and R script pair:
alf_prep_slurm(project = "JFSP", reps = 1:32)
#!/bin/bash
#SBATCH --exclusive
#SBATCH --mail-type=END
#SBATCH --mail-user=mfleonawicz@alaska.edu
#SBATCH --account=snap
#SBATCH -p main
#SBATCH --ntasks=32
#SBATCH --job-name=alf_prep_alf_prep
#SBATCH --nodes=1
# Required arguments: project, period, reps.
# Optional arguments: in_dir, out_dir.
# Example useage: project='JFSP', period='historical', reps=1:32.
# Defaults: in_dir and out_dir use alfresco package defaults from alfdef().
Rscript /atlas_scratch/mfleonawicz/alfresco/slurm_jobs/alf_prep.R project=\'JFSP\' reps=1:32 $1 $2 $3
library(alfresco)
library(parallel)
cargs <- (commandArgs(TRUE))
if (!length(cargs)) q("no") else for (z in 1:length(cargs)) eval(parse(text = cargs[[z]]))
prep_alf_stops()
if (!exists("in_dir")) in_dir <- file.path(alfdef()$alf_extract_dir, project,
"extractions")
if (!exists("out_dir")) out_dir <- file.path(snapprep::snapdef()$dist_dir, "alfresco",
project)
dir.create(out_dir, showWarnings = FALSE, recursive = TRUE)
inputs <- alf_dist_inputs(project)
n <- purrr::map_int(c("fsv", "veg", "age"), ~{
x <- dplyr::filter(inputs, Var == .x)
length(unique(paste(x$LocGroup, x$Location)))
})
mc.cores <- 32
mclapply(1:n[1], alf_dist, in_dir = file.path(in_dir, "fsv"), out_dir = out_dir,
period = period, reps = reps, mc.cores = mc.cores)
mclapply(1:n[2], alf_dist, in_dir = file.path(in_dir, "veg"), out_dir = out_dir,
period = period, reps = reps, mc.cores = mc.cores)
mclapply(1:n[3], alf_dist, in_dir = file.path(in_dir, "age"), out_dir = out_dir,
period = period, reps = reps, mc.cores = mc.cores)
Then for example, execute on the historical period extractions after exiting R. Remember that string arguments need to be escaped as shown below.
sbatch alf_prep.slurm period=\'historical\'
By default the .rds files containing the probability distributions data frames are saved to:
snapprep::snapdef()$dist_dir
#> [1] "/workspace/UA/mfleonawicz/data"
Note that you won’t have access to this or some other directories on the Atlas file system or the SNAP network, so make sure to specify alternate in_dir
and out_dir
paths as necessary.
These files represent the final output of the ALFRESCO outputs data processing pipeline. The .rds files containing the data frames of random variable empirical estimated probability densities for a range of geographic regions, climate models, emissions scenarios and time periods are utilized extensively in a number of downstream SNAP projects. Some of these data sets are also incorporated into a variety of SNAPverse data packages for use by a broader audience.