Working with SNAP data: ALFRESCO

The three A's

Access
Assemble
Analyze

Data extraction, data prep, and statistical analysis

Abstraction

Data prep should be abstracted from any analysis
"Prep" implies targeted usage, but can be generalized with careful planning
Extraction is treated separately from subsequent data prep and is most important to abstract from analysis

Extract once, analyze freely

A series of 100 procedures might consist of 50 heavy duty extraction tasks, each followed by some kind of exploration of the extracted data.

It should consist of one extraction task followed by 99 unique data analyses.

Example:

Extract data from 500,000 spatially explicit geotiffs

Data are tabled and stored efficiently for subsequent access.
No need to return repeatedly to raw maps to perform redundant extractions.
Data describe distributions of random variables rather than specific statistics

Organize data for easy analysis

Final data formats are standardized tables
General enough to work across multiple types of RV data sets, allowing easy merging and comparison of multiple data sets

Load data into a specific project to analyze

Statistical modeling: A closer look at data prep

Probability distributions

Let \(X\) be random variable, e.g., stand age, burn area, precipitation, etc.

Model the probability density function (pdf) of \(X\), \(f(X)\)

Achieve substantial data reduction without effective loss of information
Any statistic based on the RV can be calculated from its pdf.
No need to return to raw maps each time a new summary statistic is requested; compute it with the distribution model

The right tools for the job

Data manipulation

Standardized data table formats
Easy to work with, easy to read
Fast operations
Simple tables can store complex data structures and models, which are easily indexed, extracted or unnested

Example data table

##         Scenario       Model   Vegetation Year       Val         Prob
##      1: SRES A1B CCCMAcgcm31 Black Spruce 2010  76193.85 2.447226e-20
##      2: SRES A1B CCCMAcgcm31 Black Spruce 2010  76219.42 1.328957e-20
##      3: SRES A1B CCCMAcgcm31 Black Spruce 2010  76244.99 1.791705e-20
##      4: SRES A1B CCCMAcgcm31 Black Spruce 2010  76270.56 2.320447e-20
##      5: SRES A1B CCCMAcgcm31 Black Spruce 2010  76296.13 1.531966e-20
##     ---                                                              
## 299996:  SRES B1  ukmoHADcm3 White Spruce 2019 110451.77 1.402891e-20
## 299997:  SRES B1  ukmoHADcm3 White Spruce 2019 110469.13 4.138351e-21
## 299998:  SRES B1  ukmoHADcm3 White Spruce 2019 110486.48 1.690414e-20
## 299999:  SRES B1  ukmoHADcm3 White Spruce 2019 110503.84 2.569890e-20
## 300000:  SRES B1  ukmoHADcm3 White Spruce 2019 110521.20 1.547481e-20

Standardized across multiple variables
Easy to combine and compare

Straightforward manipulation, clear code

data %>% filter(Model=="CCCMAcgcm31") %>% group_by(Scenario, Model, Vegetation) %>%
    summarise(Decade="2010s", Avg=mean(Val)) %>% data.table

##    Scenario       Model   Vegetation Decade       Avg
## 1: SRES A1B CCCMAcgcm31 Black Spruce  2010s  89169.60
## 2: SRES A1B CCCMAcgcm31 White Spruce  2010s  96849.50
## 3:  SRES A2 CCCMAcgcm31 Black Spruce  2010s  91796.50
## 4:  SRES A2 CCCMAcgcm31 White Spruce  2010s 100069.50
## 5:  SRES B1 CCCMAcgcm31 Black Spruce  2010s  91327.85
## 6:  SRES B1 CCCMAcgcm31 White Spruce  2010s 100125.70

Intuitive verbs
Read left-to-right
Ideal for standard, human-guided interactive data analysis as well as non-interactive processing

Fast operations

Data table manipulations, restructuring, summarizing and statistical computations are fast
Comparable to matrix arithmetic with slight overhead
Core functionality optimized in C++, accessed via Rcpp

Organize complex data structures and statistical models in small tables

Visually compress large data tables to see the big picture at a glance

table1 <- data %>% filter(Model=="CCCMAcgcm31" & Year==2010) %>%
    group_by(Scenario, Vegetation) %>%
    do(Density=data.table(Val=Val, Prob=Prob), # column collapse
       Stats=data.frame(as.list(summary(Val))), # summary stats
       Linear_model=lm(Val~I(rnorm(length(Val)))) # complex object
    ) %>% data.table
table1

##    Scenario   Vegetation      Density        Stats Linear_model
## 1: SRES A1B Black Spruce <data.table> <data.frame>         <lm>
## 2: SRES A1B White Spruce <data.table> <data.frame>         <lm>
## 3:  SRES A2 Black Spruce <data.table> <data.frame>         <lm>
## 4:  SRES A2 White Spruce <data.table> <data.frame>         <lm>
## 5:  SRES B1 Black Spruce <data.table> <data.frame>         <lm>
## 6:  SRES B1 White Spruce <data.table> <data.frame>         <lm>

Access arbitrary objects from data table cells

table1$Linear_model[[1]] # Index and extract a specific regression model

## 
## Call:
## lm(formula = Val ~ I(rnorm(length(Val))))
## 
## Coefficients:
##           (Intercept)  I(rnorm(length(Val)))  
##               88974.3                 -284.7

Unnest the Summary column

table1 %>% select(Scenario, Vegetation, Stats) %>% unnest %>% data.table

##    Scenario   Vegetation  Min. X1st.Qu. Median  Mean X3rd.Qu.   Max.
## 1: SRES A1B Black Spruce 76190    82580  88970 88970    95350 101700
## 2: SRES A1B White Spruce 88930    92480  96030 96030    99580 103100
## 3:  SRES A2 Black Spruce 76430    83140  89850 89850    96560 103300
## 4:  SRES A2 White Spruce 90350    93840  97330 97330   100800 104300
## 5:  SRES B1 Black Spruce 76600    83200  89800 89800    96390 103000
## 6:  SRES B1 White Spruce 90650    94080  97500 97500   100900 104400

Unnest the Density column to recover the original table values

table1 %>% select(-Stats, -Linear_model) %>% unnest %>% data.table

##       Scenario   Vegetation       Val         Prob
##    1: SRES A1B Black Spruce  76193.85 2.447226e-20
##    2: SRES A1B Black Spruce  76219.42 1.328957e-20
##    3: SRES A1B Black Spruce  76244.99 1.791705e-20
##    4: SRES A1B Black Spruce  76270.56 2.320447e-20
##    5: SRES A1B Black Spruce  76296.13 1.531966e-20
##   ---                                             
## 5996:  SRES B1 White Spruce 104299.08 1.599720e-18
## 5997:  SRES B1 White Spruce 104312.79 3.402925e-19
## 5998:  SRES B1 White Spruce 104326.51 4.896929e-20
## 5999:  SRES B1 White Spruce 104340.23 2.233027e-21
## 6000:  SRES B1 White Spruce 104353.95 2.369216e-21

Distributions of random variables

Working with probability distributions

Joint distributions: \(f_{X_1,X_2,...,X_n}(X_1,X_2,...,X_n)\)

Multivariate distributions, indirectly represented by the table as a whole
Relatively trivial in this context since most factors are fixed effects

Conditional distributions: \(f_{X,Y}(X|Y=y)\)

Tables are a series of conditional distributions of \(X\)
Accessing a conditional distribution simply requires row filtering

Marginal distributions: \(f_{X,Y}(X)\)

Accessible by integrating over levels of \(Y\)

Using distribution tables: Layer of abstraction between user and detailed code

Ex: Condition, marginalize, sample

ws2014 <- data %>% filter(Vegetation=="White Spruce" & Year==2014)
values <- ws2014 %>% marginalize(margin="Model") %>%
    sample_densities
values2 <- ws2014 %>% marginalize(margin=c("Scenario", "Model")) %>%
    sample_densities

Colored lines: \(f_{X,Model}(X|Scenario)\)

Conditional distributions of \(X\) given scenario, marginalized over Model

Black line: \(f_{X,Scenario,Model}(X)\)

Marginal distribution of \(X\) with respect to Scenario and Model

Compare distributions of \(X\) defined over an increasing set of interacting factors

XgivenSM <- ws2014 %>% filter(Scenario=="SRES B1" & Model=="ukmoHADcm3") # condition on specific scenario and model
XgivenS <- ws2014 %>% filter(Scenario=="SRES B1") %>% marginalize("Model") # condition on a scenario, marginalize over models
X <- ws2014 %>% marginalize(c("Scenario", "Model")) # marginalize over scenarios and models

Marginal distributions tend to be wider

Conditioning on fixed levels of variables tends to narrow a distribution

The distribution of \(X\) is not independent of the distribution of \(Y\) (e.g., Scenario, Model) or the distribution of \(X\) would be the same regardless of the other variables' levels.

Sanity check: Does a distribution "wash out" if manipulated too many times?

Sometimes several iterations of distribution merging may be required
Each merge requires a cycle of sampling from the densities to be merged and estimating a new density from the pooled sample
If sample is sufficiently large and density sufficiently flexible, no effective information lost

Use bootDenCycle() test function to perform 10 sequential cycles of bootstrap resampling and density re-estimation

Summarizing uncertainty in a random variable

Uncertainty can be summarized with a certainty interval
Uncertainty in \(X\) compounds across multiple factors, widening the certainty interval
Individual uncertainty components by factor
Proportional uncertainty by factor
Hierarchical stepwise addition of conditional uncertainty by factor
Pairwise comparison of uncertainty from a factor due to specific factor levels
Constraints on maximum uncertainty/minimum certainty

Functions to summarise uncertainty include uc_table(), uc_components() and uc_stepwise()

uc_table provides quick access to certainty bounds
Bounds are based on a flexible definition, default- or user-provided

## Source: local data table [300 x 9]
## Groups: Scenario, Model, Vegetation, Year
## 
##    Scenario       Model   Vegetation  Year       LB     Mean        UB
##      (fctr)       (chr)       (fctr) (int)    (dbl)    (dbl)     (dbl)
## 1  SRES A1B CCCMAcgcm31 Black Spruce  2010 84010.98 90883.17  98537.57
## 2  SRES A1B CCCMAcgcm31 White Spruce  2010 90899.21 95488.15 100286.71
## 3  SRES A1B CCCMAcgcm31 Black Spruce  2011 84139.60 90654.30  97475.67
## 4  SRES A1B CCCMAcgcm31 White Spruce  2011 90300.65 95353.42 100440.58
## 5  SRES A1B CCCMAcgcm31 Black Spruce  2012 84975.92 91438.53  98349.17
## 6  SRES A1B CCCMAcgcm31 White Spruce  2012 90970.05 96078.47 101142.78
## 7  SRES A1B CCCMAcgcm31 Black Spruce  2013 81972.41 89166.18  96106.33
## 8  SRES A1B CCCMAcgcm31 White Spruce  2013 89656.30 94066.03  98652.40
## 9  SRES A1B CCCMAcgcm31 Black Spruce  2014 82268.22 89405.03  96110.89
## 10 SRES A1B CCCMAcgcm31 White Spruce  2014 89835.90 94301.33  98896.68
## ..      ...         ...          ...   ...      ...      ...       ...
## Variables not shown: Magnitude (dbl), Type (chr)

Uncertainty examples

The right data for the job

Analysis driven by specific research questions

Data exploration and statistical analysis become easier to perform
Questions become easier to address

Inverse probability: from \(f(X|Y=y)\) to \(f(Y|X=x)\)

Example: First look at a joint distribution of \(X\) and \(Y\), \(f_{X,Y}(X,Y)\)

\(X\) is total regional black spruce area in km\(^2\). \(Y\) is Scenario. Other factors fixed.

##       Scenario   Vegetation       Val         Prob
##    1: SRES A1B Black Spruce  76193.85 2.447226e-20
##    2: SRES A1B Black Spruce  76219.42 1.328957e-20
##    3: SRES A1B Black Spruce  76244.99 1.791705e-20
##    4: SRES A1B Black Spruce  76270.56 2.320447e-20
##    5: SRES A1B Black Spruce  76296.13 1.531966e-20
##   ---                                             
## 2996:  SRES B1 Black Spruce 102883.70 2.508392e-21
## 2997:  SRES B1 Black Spruce 102910.11 9.755743e-21
## 2998:  SRES B1 Black Spruce 102936.53 7.272612e-21
## 2999:  SRES B1 Black Spruce 102962.94 2.608479e-21
## 3000:  SRES B1 Black Spruce 102989.35 1.521653e-21

The table can be viewed as an indirect representation of the joint distribution of multiple random variables.
Directly, it is a sequence of conditional distributions of \(X\) given unique combinations of factor levels of other variables.
Marginal distributions of \(X\) with respect to a factor, e.g., Scenario or Model, require integrating out the factor.

To simplify, classify \(X\) into three categories. Now \(X\) and \(Y\) are both categorical.

Joint pmf Spruce Cover & Scenario
	Small	Medium	Large	Margin
SRES B1	0.0085	0.3012	0.0237	0.3333
SRES A1B	0.0134	0.3060	0.0139	0.3333
SRES A2	0.0076	0.3020	0.0237	0.3333
Margin	0.0295	0.9092	0.0613	1.0000

\[f(X,Y)/f(Y)=f(X|Y=y)\]

\[f(X,Y)/f(X)=f(Y|X=x)\]

f(X | Y = SRES B1)
Small	0.0254
Medium	0.9035
Large	0.0711

f(Y | X = Small)
SRES B1	0.2873
SRES A1B	0.4559
SRES A2	0.2568

To simplify, let's look at a more basic joint probability distribution we can construct from this table.

It is displayed as a 3 x 3 contingency table, a simple representation of a joint probability mass function where the two variables each have three discrete levels.
The nine cells in the table are non-negative and sum to one. They represent probabilities of each combination of the variables' values.
The margins also sum to one. This is where the term marginal distribution comes from; the margins of a contingency table which express the probabilities of levels of a random variable regardless of the level of the other variable.

While area was classified for simplicity, research questions also typically lead to classifying continuous random variables at some point in an analysis in order to make a statement about values in some range of interest.

We may have already used the full table to make statements about the probability that stand age is greater than x, given a specific scenario, or marginalized over all scenarios. Here the single cell marking that combination of age range and scenario level gives the joint probability.

The conditional probability of this age range given the specific scenario is the joint probability divided by the marginal probability of the specific scenario. Each scenario has a trivial probability of 1/3 since scenario is a fixed effect.

While we know the probability of a specific scenario is 1/3, uninteresting for a fixed effect, we can see how it can be derived as the joint probability divided by the conditional probability.

Why is this worth mentioning? Knowing this, and looking at the joint probability distribution of X and scenario in a contingency table, it's intuitive to see that we can work in the opposite direction.

While sometimes it is of interest to know the probability that X is greater than x given a specific scenario, sometimes research questions translate into asking the opposite: what is the probability that we are working with or looking at output from a specific scenario, given our interest in when X is greater than x?

This upper tail of the distribution of X may more likely be seen in some scenarios than others.

Conditional on the values of interest for the random variable X, we have a probability mass function for scenario. This can be applied to any combination of variables.

For example, given stand age of a certain vegetation class in a region is less than 10 years old, or given the area burned in a region exceeds some upper threshold of interest, what is the probability this can be seen under the input conditions of a specific GCM?

Example: "Bounding models"

Be specific: Bounding what?
Be objective: Define conditions. Use math.
Be rigorous: Formalize methods. Calculate probabilities.

Ad hoc decision-making from purely exploratory graphs by "eyeballing it" lacks mathematical rigor and scientific objectivity.

The graph here is the opposite: a visual representation of an objective decision, based on a formalized methodological framework.

My favorite: Explain what makes your models "hot and cold" models. Can you articulate it simply and unambiguously right away, or do you have to really think about it for the first time?

Five climate models, but for a project we may only want to work with two of them. Which two do we use? The ambiguous answer is the same as always: "the hot one and the cold one". But the time has passed for applying the unsophisticated guesswork that ambiguity demands. We have to ask what that means. What defines hot and cold? Are we more concerned about climate inputs, and if so, hot vs. cold or dry vs. wet, or are we more concerned about fire outputs? For some projects it may be more relevant to consider area cover by specific vegetation types.

Boundary conditions should be well defined. When they are, it becomes easy to use those conditions to obtain the probabilities associated with levels of a factor in question, e.g., GCM, and the two GCMs with the lowest and highest probabilities given the conditions, respectively, are the bounding GCMs. Trying to distill the choice of models in a case like this from staring an assortment of plots lacks rigor but is also unnecessary when more formal tools are available.

The end

Thank you