The three A's

  • Access
  • Assemble
  • Analyze

Data extraction, data prep, and statistical analysis

Abstraction

  • Data prep should be abstracted from any analysis
  • "Prep" implies targeted usage, but can be generalized with careful planning
  • Extraction is treated separately from subsequent data prep and is most important to abstract from analysis

Extract once, analyze freely

A series of 100 procedures might consist of 50 heavy duty extraction tasks, each followed by some kind of exploration of the extracted data.

It should consist of one extraction task followed by 99 unique data analyses.

Example:

Extract data from 500,000 spatially explicit geotiffs

  • Data are tabled and stored efficiently for subsequent access.
  • No need to return repeatedly to raw maps to perform redundant extractions.
  • Data describe distributions of random variables rather than specific statistics

Organize data for easy analysis

  • Final data formats are standardized tables
  • General enough to work across multiple types of RV data sets, allowing easy merging and comparison of multiple data sets

Load data into a specific project to analyze

Statistical modeling: A closer look at data prep

Probability distributions

Let \(X\) be random variable, e.g., stand age, burn area, precipitation, etc.

Model the probability density function (pdf) of \(X\), \(f(X)\)

  • Achieve substantial data reduction without effective loss of information
  • Any statistic based on the RV can be calculated from its pdf.
  • No need to return to raw maps each time a new summary statistic is requested; compute it with the distribution model

The right tools for the job

Data manipulation

  • Standardized data table formats
  • Easy to work with, easy to read
  • Fast operations
  • Simple tables can store complex data structures and models, which are easily indexed, extracted or unnested

Example data table

##         Scenario       Model   Vegetation Year       Val         Prob
##      1: SRES A1B CCCMAcgcm31 Black Spruce 2010  76193.85 2.447226e-20
##      2: SRES A1B CCCMAcgcm31 Black Spruce 2010  76219.42 1.328957e-20
##      3: SRES A1B CCCMAcgcm31 Black Spruce 2010  76244.99 1.791705e-20
##      4: SRES A1B CCCMAcgcm31 Black Spruce 2010  76270.56 2.320447e-20
##      5: SRES A1B CCCMAcgcm31 Black Spruce 2010  76296.13 1.531966e-20
##     ---                                                              
## 299996:  SRES B1  ukmoHADcm3 White Spruce 2019 110451.77 1.402891e-20
## 299997:  SRES B1  ukmoHADcm3 White Spruce 2019 110469.13 4.138351e-21
## 299998:  SRES B1  ukmoHADcm3 White Spruce 2019 110486.48 1.690414e-20
## 299999:  SRES B1  ukmoHADcm3 White Spruce 2019 110503.84 2.569890e-20
## 300000:  SRES B1  ukmoHADcm3 White Spruce 2019 110521.20 1.547481e-20
  • Standardized across multiple variables
  • Easy to combine and compare

Straightforward manipulation, clear code

data %>% filter(Model=="CCCMAcgcm31") %>% group_by(Scenario, Model, Vegetation) %>%
    summarise(Decade="2010s", Avg=mean(Val)) %>% data.table
##    Scenario       Model   Vegetation Decade       Avg
## 1: SRES A1B CCCMAcgcm31 Black Spruce  2010s  89169.60
## 2: SRES A1B CCCMAcgcm31 White Spruce  2010s  96849.50
## 3:  SRES A2 CCCMAcgcm31 Black Spruce  2010s  91796.50
## 4:  SRES A2 CCCMAcgcm31 White Spruce  2010s 100069.50
## 5:  SRES B1 CCCMAcgcm31 Black Spruce  2010s  91327.85
## 6:  SRES B1 CCCMAcgcm31 White Spruce  2010s 100125.70
  • Intuitive verbs
  • Read left-to-right
  • Ideal for standard, human-guided interactive data analysis as well as non-interactive processing

Fast operations

  • Data table manipulations, restructuring, summarizing and statistical computations are fast
  • Comparable to matrix arithmetic with slight overhead
  • Core functionality optimized in C++, accessed via Rcpp

Organize complex data structures and statistical models in small tables

  • Visually compress large data tables to see the big picture at a glance
table1 <- data %>% filter(Model=="CCCMAcgcm31" & Year==2010) %>%
    group_by(Scenario, Vegetation) %>%
    do(Density=data.table(Val=Val, Prob=Prob), # column collapse
       Stats=data.frame(as.list(summary(Val))), # summary stats
       Linear_model=lm(Val~I(rnorm(length(Val)))) # complex object
    ) %>% data.table
table1
##    Scenario   Vegetation      Density        Stats Linear_model
## 1: SRES A1B Black Spruce <data.table> <data.frame>         <lm>
## 2: SRES A1B White Spruce <data.table> <data.frame>         <lm>
## 3:  SRES A2 Black Spruce <data.table> <data.frame>         <lm>
## 4:  SRES A2 White Spruce <data.table> <data.frame>         <lm>
## 5:  SRES B1 Black Spruce <data.table> <data.frame>         <lm>
## 6:  SRES B1 White Spruce <data.table> <data.frame>         <lm>
  • Access arbitrary objects from data table cells
table1$Linear_model[[1]] # Index and extract a specific regression model
## 
## Call:
## lm(formula = Val ~ I(rnorm(length(Val))))
## 
## Coefficients:
##           (Intercept)  I(rnorm(length(Val)))  
##               88974.3                 -284.7

Unnest the Summary column

table1 %>% select(Scenario, Vegetation, Stats) %>% unnest %>% data.table
##    Scenario   Vegetation  Min. X1st.Qu. Median  Mean X3rd.Qu.   Max.
## 1: SRES A1B Black Spruce 76190    82580  88970 88970    95350 101700
## 2: SRES A1B White Spruce 88930    92480  96030 96030    99580 103100
## 3:  SRES A2 Black Spruce 76430    83140  89850 89850    96560 103300
## 4:  SRES A2 White Spruce 90350    93840  97330 97330   100800 104300
## 5:  SRES B1 Black Spruce 76600    83200  89800 89800    96390 103000
## 6:  SRES B1 White Spruce 90650    94080  97500 97500   100900 104400

Unnest the Density column to recover the original table values

table1 %>% select(-Stats, -Linear_model) %>% unnest %>% data.table
##       Scenario   Vegetation       Val         Prob
##    1: SRES A1B Black Spruce  76193.85 2.447226e-20
##    2: SRES A1B Black Spruce  76219.42 1.328957e-20
##    3: SRES A1B Black Spruce  76244.99 1.791705e-20
##    4: SRES A1B Black Spruce  76270.56 2.320447e-20
##    5: SRES A1B Black Spruce  76296.13 1.531966e-20
##   ---                                             
## 5996:  SRES B1 White Spruce 104299.08 1.599720e-18
## 5997:  SRES B1 White Spruce 104312.79 3.402925e-19
## 5998:  SRES B1 White Spruce 104326.51 4.896929e-20
## 5999:  SRES B1 White Spruce 104340.23 2.233027e-21
## 6000:  SRES B1 White Spruce 104353.95 2.369216e-21

Distributions of random variables

Working with probability distributions

Joint distributions: \(f_{X_1,X_2,...,X_n}(X_1,X_2,...,X_n)\)

  • Multivariate distributions, indirectly represented by the table as a whole
  • Relatively trivial in this context since most factors are fixed effects

Conditional distributions: \(f_{X,Y}(X|Y=y)\)

  • Tables are a series of conditional distributions of \(X\)
  • Accessing a conditional distribution simply requires row filtering

Marginal distributions: \(f_{X,Y}(X)\)

  • Accessible by integrating over levels of \(Y\)

Using distribution tables: Layer of abstraction between user and detailed code

Ex: Condition, marginalize, sample

ws2014 <- data %>% filter(Vegetation=="White Spruce" & Year==2014)
values <- ws2014 %>% marginalize(margin="Model") %>%
    sample_densities
values2 <- ws2014 %>% marginalize(margin=c("Scenario", "Model")) %>%
    sample_densities

Colored lines: \(f_{X,Model}(X|Scenario)\)

Conditional distributions of \(X\) given scenario, marginalized over Model

Black line: \(f_{X,Scenario,Model}(X)\)

Marginal distribution of \(X\) with respect to Scenario and Model

Compare distributions of \(X\) defined over an increasing set of interacting factors

XgivenSM <- ws2014 %>% filter(Scenario=="SRES B1" & Model=="ukmoHADcm3") # condition on specific scenario and model
XgivenS <- ws2014 %>% filter(Scenario=="SRES B1") %>% marginalize("Model") # condition on a scenario, marginalize over models
X <- ws2014 %>% marginalize(c("Scenario", "Model")) # marginalize over scenarios and models

Marginal distributions tend to be wider

Conditioning on fixed levels of variables tends to narrow a distribution

The distribution of \(X\) is not independent of the distribution of \(Y\) (e.g., Scenario, Model) or the distribution of \(X\) would be the same regardless of the other variables' levels.

Sanity check: Does a distribution "wash out" if manipulated too many times?

  • Sometimes several iterations of distribution merging may be required
  • Each merge requires a cycle of sampling from the densities to be merged and estimating a new density from the pooled sample
  • If sample is sufficiently large and density sufficiently flexible, no effective information lost
  • Use bootDenCycle() test function to perform 10 sequential cycles of bootstrap resampling and density re-estimation

Summarizing uncertainty in a random variable

  • Uncertainty can be summarized with a certainty interval
  • Uncertainty in \(X\) compounds across multiple factors, widening the certainty interval
  • Individual uncertainty components by factor
  • Proportional uncertainty by factor
  • Hierarchical stepwise addition of conditional uncertainty by factor
  • Pairwise comparison of uncertainty from a factor due to specific factor levels
  • Constraints on maximum uncertainty/minimum certainty

Functions to summarise uncertainty include uc_table(), uc_components() and uc_stepwise()

  • uc_table provides quick access to certainty bounds
  • Bounds are based on a flexible definition, default- or user-provided
## Source: local data table [300 x 9]
## Groups: Scenario, Model, Vegetation, Year
## 
##    Scenario       Model   Vegetation  Year       LB     Mean        UB
##      (fctr)       (chr)       (fctr) (int)    (dbl)    (dbl)     (dbl)
## 1  SRES A1B CCCMAcgcm31 Black Spruce  2010 84010.98 90883.17  98537.57
## 2  SRES A1B CCCMAcgcm31 White Spruce  2010 90899.21 95488.15 100286.71
## 3  SRES A1B CCCMAcgcm31 Black Spruce  2011 84139.60 90654.30  97475.67
## 4  SRES A1B CCCMAcgcm31 White Spruce  2011 90300.65 95353.42 100440.58
## 5  SRES A1B CCCMAcgcm31 Black Spruce  2012 84975.92 91438.53  98349.17
## 6  SRES A1B CCCMAcgcm31 White Spruce  2012 90970.05 96078.47 101142.78
## 7  SRES A1B CCCMAcgcm31 Black Spruce  2013 81972.41 89166.18  96106.33
## 8  SRES A1B CCCMAcgcm31 White Spruce  2013 89656.30 94066.03  98652.40
## 9  SRES A1B CCCMAcgcm31 Black Spruce  2014 82268.22 89405.03  96110.89
## 10 SRES A1B CCCMAcgcm31 White Spruce  2014 89835.90 94301.33  98896.68
## ..      ...         ...          ...   ...      ...      ...       ...
## Variables not shown: Magnitude (dbl), Type (chr)

Uncertainty examples

The right data for the job

Analysis driven by specific research questions

  • Data exploration and statistical analysis become easier to perform
  • Questions become easier to address

Inverse probability: from \(f(X|Y=y)\) to \(f(Y|X=x)\)

Example: First look at a joint distribution of \(X\) and \(Y\), \(f_{X,Y}(X,Y)\)

\(X\) is total regional black spruce area in km\(^2\). \(Y\) is Scenario. Other factors fixed.

##       Scenario   Vegetation       Val         Prob
##    1: SRES A1B Black Spruce  76193.85 2.447226e-20
##    2: SRES A1B Black Spruce  76219.42 1.328957e-20
##    3: SRES A1B Black Spruce  76244.99 1.791705e-20
##    4: SRES A1B Black Spruce  76270.56 2.320447e-20
##    5: SRES A1B Black Spruce  76296.13 1.531966e-20
##   ---                                             
## 2996:  SRES B1 Black Spruce 102883.70 2.508392e-21
## 2997:  SRES B1 Black Spruce 102910.11 9.755743e-21
## 2998:  SRES B1 Black Spruce 102936.53 7.272612e-21
## 2999:  SRES B1 Black Spruce 102962.94 2.608479e-21
## 3000:  SRES B1 Black Spruce 102989.35 1.521653e-21
  • The table can be viewed as an indirect representation of the joint distribution of multiple random variables.
  • Directly, it is a sequence of conditional distributions of \(X\) given unique combinations of factor levels of other variables.
  • Marginal distributions of \(X\) with respect to a factor, e.g., Scenario or Model, require integrating out the factor.

To simplify, classify \(X\) into three categories. Now \(X\) and \(Y\) are both categorical.

Joint pmf Spruce Cover & Scenario
Small Medium Large Margin
SRES B1 0.0085 0.3012 0.0237 0.3333
SRES A1B 0.0134 0.3060 0.0139 0.3333
SRES A2 0.0076 0.3020 0.0237 0.3333
Margin 0.0295 0.9092 0.0613 1.0000

\[f(X,Y)/f(Y)=f(X|Y=y)\]

\[f(X,Y)/f(X)=f(Y|X=x)\]

f(X | Y = SRES B1)
Small 0.0254
Medium 0.9035
Large 0.0711
f(Y | X = Small)
SRES B1 0.2873
SRES A1B 0.4559
SRES A2 0.2568

Example: "Bounding models"

  • Be specific: Bounding what?
  • Be objective: Define conditions. Use math.
  • Be rigorous: Formalize methods. Calculate probabilities.

Ad hoc decision-making from purely exploratory graphs by "eyeballing it" lacks mathematical rigor and scientific objectivity.

The graph here is the opposite: a visual representation of an objective decision, based on a formalized methodological framework.

My favorite: Explain what makes your models "hot and cold" models. Can you articulate it simply and unambiguously right away, or do you have to really think about it for the first time?

The end

Thank you