- Access
- Assemble
- Analyze
Data extraction, data prep, and statistical analysis
Data extraction, data prep, and statistical analysis
A series of 100 procedures might consist of 50 heavy duty extraction tasks, each followed by some kind of exploration of the extracted data.
It should consist of one extraction task followed by 99 unique data analyses.
Example:
Extract data from 500,000 spatially explicit geotiffs
Organize data for easy analysis
Load data into a specific project to analyze
Probability distributions
Let \(X\) be random variable, e.g., stand age, burn area, precipitation, etc.
Model the probability density function (pdf) of \(X\), \(f(X)\)
Example data table
## Scenario Model Vegetation Year Val Prob ## 1: SRES A1B CCCMAcgcm31 Black Spruce 2010 76193.85 2.447226e-20 ## 2: SRES A1B CCCMAcgcm31 Black Spruce 2010 76219.42 1.328957e-20 ## 3: SRES A1B CCCMAcgcm31 Black Spruce 2010 76244.99 1.791705e-20 ## 4: SRES A1B CCCMAcgcm31 Black Spruce 2010 76270.56 2.320447e-20 ## 5: SRES A1B CCCMAcgcm31 Black Spruce 2010 76296.13 1.531966e-20 ## --- ## 299996: SRES B1 ukmoHADcm3 White Spruce 2019 110451.77 1.402891e-20 ## 299997: SRES B1 ukmoHADcm3 White Spruce 2019 110469.13 4.138351e-21 ## 299998: SRES B1 ukmoHADcm3 White Spruce 2019 110486.48 1.690414e-20 ## 299999: SRES B1 ukmoHADcm3 White Spruce 2019 110503.84 2.569890e-20 ## 300000: SRES B1 ukmoHADcm3 White Spruce 2019 110521.20 1.547481e-20
Straightforward manipulation, clear code
data %>% filter(Model=="CCCMAcgcm31") %>% group_by(Scenario, Model, Vegetation) %>% summarise(Decade="2010s", Avg=mean(Val)) %>% data.table
## Scenario Model Vegetation Decade Avg ## 1: SRES A1B CCCMAcgcm31 Black Spruce 2010s 89169.60 ## 2: SRES A1B CCCMAcgcm31 White Spruce 2010s 96849.50 ## 3: SRES A2 CCCMAcgcm31 Black Spruce 2010s 91796.50 ## 4: SRES A2 CCCMAcgcm31 White Spruce 2010s 100069.50 ## 5: SRES B1 CCCMAcgcm31 Black Spruce 2010s 91327.85 ## 6: SRES B1 CCCMAcgcm31 White Spruce 2010s 100125.70
Fast operations
Organize complex data structures and statistical models in small tables
table1 <- data %>% filter(Model=="CCCMAcgcm31" & Year==2010) %>% group_by(Scenario, Vegetation) %>% do(Density=data.table(Val=Val, Prob=Prob), # column collapse Stats=data.frame(as.list(summary(Val))), # summary stats Linear_model=lm(Val~I(rnorm(length(Val)))) # complex object ) %>% data.table table1
## Scenario Vegetation Density Stats Linear_model ## 1: SRES A1B Black Spruce <data.table> <data.frame> <lm> ## 2: SRES A1B White Spruce <data.table> <data.frame> <lm> ## 3: SRES A2 Black Spruce <data.table> <data.frame> <lm> ## 4: SRES A2 White Spruce <data.table> <data.frame> <lm> ## 5: SRES B1 Black Spruce <data.table> <data.frame> <lm> ## 6: SRES B1 White Spruce <data.table> <data.frame> <lm>
table1$Linear_model[[1]] # Index and extract a specific regression model
## ## Call: ## lm(formula = Val ~ I(rnorm(length(Val)))) ## ## Coefficients: ## (Intercept) I(rnorm(length(Val))) ## 88974.3 -284.7
Unnest the Summary column
table1 %>% select(Scenario, Vegetation, Stats) %>% unnest %>% data.table
## Scenario Vegetation Min. X1st.Qu. Median Mean X3rd.Qu. Max. ## 1: SRES A1B Black Spruce 76190 82580 88970 88970 95350 101700 ## 2: SRES A1B White Spruce 88930 92480 96030 96030 99580 103100 ## 3: SRES A2 Black Spruce 76430 83140 89850 89850 96560 103300 ## 4: SRES A2 White Spruce 90350 93840 97330 97330 100800 104300 ## 5: SRES B1 Black Spruce 76600 83200 89800 89800 96390 103000 ## 6: SRES B1 White Spruce 90650 94080 97500 97500 100900 104400
Unnest the Density column to recover the original table values
table1 %>% select(-Stats, -Linear_model) %>% unnest %>% data.table
## Scenario Vegetation Val Prob ## 1: SRES A1B Black Spruce 76193.85 2.447226e-20 ## 2: SRES A1B Black Spruce 76219.42 1.328957e-20 ## 3: SRES A1B Black Spruce 76244.99 1.791705e-20 ## 4: SRES A1B Black Spruce 76270.56 2.320447e-20 ## 5: SRES A1B Black Spruce 76296.13 1.531966e-20 ## --- ## 5996: SRES B1 White Spruce 104299.08 1.599720e-18 ## 5997: SRES B1 White Spruce 104312.79 3.402925e-19 ## 5998: SRES B1 White Spruce 104326.51 4.896929e-20 ## 5999: SRES B1 White Spruce 104340.23 2.233027e-21 ## 6000: SRES B1 White Spruce 104353.95 2.369216e-21
Joint distributions: \(f_{X_1,X_2,...,X_n}(X_1,X_2,...,X_n)\)
Conditional distributions: \(f_{X,Y}(X|Y=y)\)
Marginal distributions: \(f_{X,Y}(X)\)
Using distribution tables: Layer of abstraction between user and detailed code
Ex: Condition, marginalize, sample
ws2014 <- data %>% filter(Vegetation=="White Spruce" & Year==2014) values <- ws2014 %>% marginalize(margin="Model") %>% sample_densities values2 <- ws2014 %>% marginalize(margin=c("Scenario", "Model")) %>% sample_densities
Colored lines: \(f_{X,Model}(X|Scenario)\)
Conditional distributions of \(X\) given scenario, marginalized over Model
Black line: \(f_{X,Scenario,Model}(X)\)
Marginal distribution of \(X\) with respect to Scenario and Model
Compare distributions of \(X\) defined over an increasing set of interacting factors
XgivenSM <- ws2014 %>% filter(Scenario=="SRES B1" & Model=="ukmoHADcm3") # condition on specific scenario and model XgivenS <- ws2014 %>% filter(Scenario=="SRES B1") %>% marginalize("Model") # condition on a scenario, marginalize over models X <- ws2014 %>% marginalize(c("Scenario", "Model")) # marginalize over scenarios and models
Marginal distributions tend to be wider
Conditioning on fixed levels of variables tends to narrow a distribution
The distribution of \(X\) is not independent of the distribution of \(Y\) (e.g., Scenario, Model) or the distribution of \(X\) would be the same regardless of the other variables' levels.
Sanity check: Does a distribution "wash out" if manipulated too many times?
Functions to summarise uncertainty include uc_table(), uc_components() and uc_stepwise()
## Source: local data table [300 x 9] ## Groups: Scenario, Model, Vegetation, Year ## ## Scenario Model Vegetation Year LB Mean UB ## (fctr) (chr) (fctr) (int) (dbl) (dbl) (dbl) ## 1 SRES A1B CCCMAcgcm31 Black Spruce 2010 84010.98 90883.17 98537.57 ## 2 SRES A1B CCCMAcgcm31 White Spruce 2010 90899.21 95488.15 100286.71 ## 3 SRES A1B CCCMAcgcm31 Black Spruce 2011 84139.60 90654.30 97475.67 ## 4 SRES A1B CCCMAcgcm31 White Spruce 2011 90300.65 95353.42 100440.58 ## 5 SRES A1B CCCMAcgcm31 Black Spruce 2012 84975.92 91438.53 98349.17 ## 6 SRES A1B CCCMAcgcm31 White Spruce 2012 90970.05 96078.47 101142.78 ## 7 SRES A1B CCCMAcgcm31 Black Spruce 2013 81972.41 89166.18 96106.33 ## 8 SRES A1B CCCMAcgcm31 White Spruce 2013 89656.30 94066.03 98652.40 ## 9 SRES A1B CCCMAcgcm31 Black Spruce 2014 82268.22 89405.03 96110.89 ## 10 SRES A1B CCCMAcgcm31 White Spruce 2014 89835.90 94301.33 98896.68 ## .. ... ... ... ... ... ... ... ## Variables not shown: Magnitude (dbl), Type (chr)
Inverse probability: from \(f(X|Y=y)\) to \(f(Y|X=x)\)
Example: First look at a joint distribution of \(X\) and \(Y\), \(f_{X,Y}(X,Y)\)
\(X\) is total regional black spruce area in km\(^2\). \(Y\) is Scenario. Other factors fixed.
## Scenario Vegetation Val Prob ## 1: SRES A1B Black Spruce 76193.85 2.447226e-20 ## 2: SRES A1B Black Spruce 76219.42 1.328957e-20 ## 3: SRES A1B Black Spruce 76244.99 1.791705e-20 ## 4: SRES A1B Black Spruce 76270.56 2.320447e-20 ## 5: SRES A1B Black Spruce 76296.13 1.531966e-20 ## --- ## 2996: SRES B1 Black Spruce 102883.70 2.508392e-21 ## 2997: SRES B1 Black Spruce 102910.11 9.755743e-21 ## 2998: SRES B1 Black Spruce 102936.53 7.272612e-21 ## 2999: SRES B1 Black Spruce 102962.94 2.608479e-21 ## 3000: SRES B1 Black Spruce 102989.35 1.521653e-21
To simplify, classify \(X\) into three categories. Now \(X\) and \(Y\) are both categorical.
Small | Medium | Large | Margin | |
---|---|---|---|---|
SRES B1 | 0.0085 | 0.3012 | 0.0237 | 0.3333 |
SRES A1B | 0.0134 | 0.3060 | 0.0139 | 0.3333 |
SRES A2 | 0.0076 | 0.3020 | 0.0237 | 0.3333 |
Margin | 0.0295 | 0.9092 | 0.0613 | 1.0000 |
\[f(X,Y)/f(Y)=f(X|Y=y)\]
\[f(X,Y)/f(X)=f(Y|X=x)\]
Small | 0.0254 |
Medium | 0.9035 |
Large | 0.0711 |
SRES B1 | 0.2873 |
SRES A1B | 0.4559 |
SRES A2 | 0.2568 |
Example: "Bounding models"
Ad hoc decision-making from purely exploratory graphs by "eyeballing it" lacks mathematical rigor and scientific objectivity.
The graph here is the opposite: a visual representation of an objective decision, based on a formalized methodological framework.
My favorite: Explain what makes your models "hot and cold" models. Can you articulate it simply and unambiguously right away, or do you have to really think about it for the first time?Thank you