SNAPverse data package: snapclim

Available climate data

To begin using snapclim effectively, take a look at the climate data collections available in your current version of the package.

library(snapclim)
climate_collections()
#> # A tibble: 1 x 11
#>   id    description regions points start   end daily monthly seasonal
#>   <chr> <chr>       <lgl>   <lgl>  <dbl> <dbl> <lgl> <lgl>   <lgl>   
#> 1 ar5s~ AR5/CMIP5 ~ TRUE    TRUE    1901  2100 FALSE TRUE    TRUE    
#> # ... with 2 more variables: annual <lgl>, decadal <lgl>

This prints a table of available data sets, one per row. The columns provide useful metadata regarding each data set.

Available locations

For most data sets, a request returns data for a specific location. Given that there are 86 defined climate regions and 3,867 point locations across Alaska and western Canada in the ar5stats collection, it is helpful to consult a table of available locations.

climate_locations()
#> # A tibble: 3,953 x 2
#>    Location                                Group               
#>    <chr>                                   <chr>               
#>  1 AK-CAN                                  AK-CAN              
#>  2 Arctic LCC                              AK LCC regions      
#>  3 North Pacific LCC                       AK LCC regions      
#>  4 Northwestern Interior Forest  South LCC AK LCC regions      
#>  5 Northwestern Interior Forest North LCC  AK LCC regions      
#>  6 Western Alaska LCC                      AK LCC regions      
#>  7 Boreal                                  Alaska L1 Ecoregions
#>  8 Maritime                                Alaska L1 Ecoregions
#>  9 Polar                                   Alaska L1 Ecoregions
#> 10 Alaska Range Transition                 Alaska L2 Ecoregions
#> # ... with 3,943 more rows

The table is truncated here, but contains nearly 4,000 options under the Location column. There is a corresponding column providing the set that each location is grouped under. Every location belongs to a group. The table above contains all regions and point locations. climate_locations can be used to list a subset of only one or the other.

climate_locations(type = "region")
#> # A tibble: 86 x 2
#>    Location                                Group               
#>    <chr>                                   <chr>               
#>  1 AK-CAN                                  AK-CAN              
#>  2 Arctic LCC                              AK LCC regions      
#>  3 North Pacific LCC                       AK LCC regions      
#>  4 Northwestern Interior Forest  South LCC AK LCC regions      
#>  5 Northwestern Interior Forest North LCC  AK LCC regions      
#>  6 Western Alaska LCC                      AK LCC regions      
#>  7 Boreal                                  Alaska L1 Ecoregions
#>  8 Maritime                                Alaska L1 Ecoregions
#>  9 Polar                                   Alaska L1 Ecoregions
#> 10 Alaska Range Transition                 Alaska L2 Ecoregions
#> # ... with 76 more rows
climate_locations(type = "point")
#> # A tibble: 3,867 x 2
#>    Location      Group 
#>    <chr>         <chr> 
#>  1 Adak Station  Alaska
#>  2 Afognak       Alaska
#>  3 Akhiok        Alaska
#>  4 Akiachak      Alaska
#>  5 Akiak         Alaska
#>  6 Akutan        Alaska
#>  7 Alakanuk      Alaska
#>  8 Alatna        Alaska
#>  9 Aleknagik     Alaska
#> 10 Aleut Village Alaska
#> # ... with 3,857 more rows

Some locations share the same name. For example, there is Galena, Alaska and Galena, British Columbia.

library(dplyr)
climate_locations() %>% filter(Location == "Galena")
#> # A tibble: 2 x 2
#>   Location Group           
#>   <chr>    <chr>           
#> 1 Galena   Alaska          
#> 2 Galena   British Columbia

It is good practice to avoid ambiguity when requesting data, though it is permitted since familiarity with the available locations you are interested in can make your data requests simpler.

Data requests

snapclim provides access to a large amount of SNAP climate data, far more than would be stored locally within an R package. The data collections are stored on Amazon Web Services (AWS). snapclim interfaces with AWS to bring the specific data you need into your R session, as if it were a native package data set.

SNAP climate data sets are accessed with climdata. If at any time you get stuck with using climdata, see the function documentation. It provides detailed descriptions and usage for the available function arguments. The first argument, id, specifies a unique data collection. See the id column in climate_collections above. Next, a location is specified. A simple call to climdata for SNAP 2-km downscaled AR5/CMIP5 climate data summary statistics for Anchorage, Alaska looks like the following.

climdata("ar5stats", "Anchorage")
#> # A tibble: 101,400 x 8
#>    RCP        Model   Var   Group  Location   Year Month  Mean
#>    <fct>      <fct>   <fct> <chr>  <chr>     <int> <fct> <dbl>
#>  1 Historical CRU 4.0 pr    Alaska Anchorage  1901 Jan       7
#>  2 Historical CRU 4.0 pr    Alaska Anchorage  1901 Feb       1
#>  3 Historical CRU 4.0 pr    Alaska Anchorage  1901 Mar      14
#>  4 Historical CRU 4.0 pr    Alaska Anchorage  1901 Apr      16
#>  5 Historical CRU 4.0 pr    Alaska Anchorage  1901 May      13
#>  6 Historical CRU 4.0 pr    Alaska Anchorage  1901 Jun      24
#>  7 Historical CRU 4.0 pr    Alaska Anchorage  1901 Jul      40
#>  8 Historical CRU 4.0 pr    Alaska Anchorage  1901 Aug      70
#>  9 Historical CRU 4.0 pr    Alaska Anchorage  1901 Sep      58
#> 10 Historical CRU 4.0 pr    Alaska Anchorage  1901 Oct      69
#> # ... with 101,390 more rows

In subsequent examples, arguments are named for additional clarity.

The data includes SNAP’s downscaled historical, observation-based Climatological Research Unit (CRU) 4.0 data and both downscaled historical and projected climate model outputs for all five of the General Circulation Models (GCMs) utilized by SNAP. All three CMIP5 emissions scenarios, or Representative Concentration Pathways (RCPs) are included. The data cover the entire available time period at a monthly time step.

By default, all available climate variables are returned: precipitation and mean, minimum and maximum temperature. These refer to monthly precipitation totals and monthly means of mean, minimum and maximum daily temperatures. This can be reduced to a specific variable in the initial call to climdata with variable = "pr" for example, or the table can be filtered subsequently.

What if the location is not unique, like Galena? climdata will throw a warning and let you know it is assuming the first group found in the list of available locations.

climdata(id = "ar5stats", area = "Galena")
#> Warning in .check_area(area, set): `area` not unique and `set` not
#> provided. Assuming 'Alaska'. Please provide `set`.
#> # A tibble: 101,400 x 8
#>    RCP        Model   Var   Group  Location  Year Month  Mean
#>    <fct>      <fct>   <fct> <chr>  <chr>    <int> <fct> <dbl>
#>  1 Historical CRU 4.0 pr    Alaska Galena    1901 Jan      18
#>  2 Historical CRU 4.0 pr    Alaska Galena    1901 Feb      18
#>  3 Historical CRU 4.0 pr    Alaska Galena    1901 Mar      18
#>  4 Historical CRU 4.0 pr    Alaska Galena    1901 Apr      16
#>  5 Historical CRU 4.0 pr    Alaska Galena    1901 May      15
#>  6 Historical CRU 4.0 pr    Alaska Galena    1901 Jun      33
#>  7 Historical CRU 4.0 pr    Alaska Galena    1901 Jul      48
#>  8 Historical CRU 4.0 pr    Alaska Galena    1901 Aug      61
#>  9 Historical CRU 4.0 pr    Alaska Galena    1901 Sep      42
#> 10 Historical CRU 4.0 pr    Alaska Galena    1901 Oct      30
#> # ... with 101,390 more rows

The following example avoids the ambiguity, hence no warning.

climdata(id = "ar5stats", area = "Galena", set = "British Columbia")
#> # A tibble: 101,400 x 8
#>    RCP        Model   Var   Group            Location  Year Month  Mean
#>    <fct>      <fct>   <fct> <chr>            <chr>    <int> <fct> <dbl>
#>  1 Historical CRU 4.0 pr    British Columbia Galena    1901 Jan      87
#>  2 Historical CRU 4.0 pr    British Columbia Galena    1901 Feb      62
#>  3 Historical CRU 4.0 pr    British Columbia Galena    1901 Mar      26
#>  4 Historical CRU 4.0 pr    British Columbia Galena    1901 Apr      34
#>  5 Historical CRU 4.0 pr    British Columbia Galena    1901 May      27
#>  6 Historical CRU 4.0 pr    British Columbia Galena    1901 Jun      87
#>  7 Historical CRU 4.0 pr    British Columbia Galena    1901 Jul      36
#>  8 Historical CRU 4.0 pr    British Columbia Galena    1901 Aug       6
#>  9 Historical CRU 4.0 pr    British Columbia Galena    1901 Sep      69
#> 10 Historical CRU 4.0 pr    British Columbia Galena    1901 Oct      21
#> # ... with 101,390 more rows

Regional statistics

The climate variable values given in the tables obtained so far have all pertained to specific points in space. For regional climate data, a broad set of statistics is available that summarizes the distribution of climate values over a spatial domain defined by a polygon. For example, the table for the Arctic Tundra contains the mean, standard deviation, minimum, maximum and a set of distribution quantiles. Since the columns are truncated when printed in the display below, the first seven columns of ID variables are dropped in this example using select in order to show more of the additional statistics.

x <- climdata(id = "ar5stats", area = "Arctic Tundra")
select(x, -c(1:7))
#> # A tibble: 101,400 x 13
#>     Mean    SD   Min   Max Pct_025 Pct_05 Pct_10 Pct_25 Pct_50 Pct_75
#>    <dbl> <dbl> <dbl> <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#>  1  19.3   6.7   5.8  56.2     9.1   10     11.1   14.8   18.3   23  
#>  2  15.4   5.6   4.1  46.1     6.9    8.1    9     11.1   14     18.9
#>  3  14.2   5.6   2.9  38.1     6.9    8.1    9      9.9   13     17.1
#>  4  15.5   6.5   4    49.7     7.9    8.1    9     10.8   13.9   19.1
#>  5  15.1   7.8   0.7  53.1     7      7.8    8      8.9   12.2   19.9
#>  6  29.9  16.6  10.5 126.     12.6   13.5   14.2   17.3   23.9   39.8
#>  7  49.7  22    12.5 144.     17.9   21.3   25.1   31.3   46     64.5
#>  8  63.7  25.6  22.4 167.     28.8   30.6   32.8   41.5   61.5   80.1
#>  9  43.4  24.3  14.6 144      17.9   19     20.5   23.1   36.2   57  
#> 10  33.3  11.2  17.8  80.4    20.1   20.9   21.8   23.9   30.7   39.8
#> # ... with 101,390 more rows, and 3 more variables: Pct_90 <dbl>,
#> #   Pct_95 <dbl>, Pct_975 <dbl>

Note that these statistics summarize values across space. They do not also summarize values over months or years, or across climate models and scenarios. Distributional information is available for each point in time and under each combination of other available factors.

Seasonal and annual data

More highly aggregated data sets are available as well. The previous data sets were returned using the default argument time_scale = "monthly". If you simply change this to seasonal or annual, climdata will return the respective data set. The first few columns have been dropped:

x <- climdata(id = "ar5stats", area = "Arctic Tundra", time_scale = "seasonal")
select(x, -c(1:3))
#> # A tibble: 33,800 x 17
#>    Group Region  Year Season  Mean    SD   Min   Max Pct_025 Pct_05 Pct_10
#>    <chr> <chr>  <int> <fct>  <dbl> <dbl> <dbl> <dbl>   <dbl>  <dbl>  <dbl>
#>  1 Alas~ Arcti~  1901 Winter  18.9   7.8   1.5  65.3     8.1    9.1   10.1
#>  2 Alas~ Arcti~  1901 Spring  15     6.7   0.7  51.2     7.3    7.9    8.2
#>  3 Alas~ Arcti~  1901 Summer  47.4  25.5  10.1 157.     13.6   14.8   18.2
#>  4 Alas~ Arcti~  1901 Autumn  33.4  18.3   6.3 144.     12.2   13.9   17.6
#>  5 Alas~ Arcti~  1902 Winter  19.1   8.1   1.6  66.9     8      9     10  
#>  6 Alas~ Arcti~  1902 Spring  15     6.7   0.5  53       7.5    7.9    8.3
#>  7 Alas~ Arcti~  1902 Summer  47.3  25.4  10.5 168      13.7   14.7   18  
#>  8 Alas~ Arcti~  1902 Autumn  32.9  17.9   6.7 143      12     13.8   17.6
#>  9 Alas~ Arcti~  1903 Winter  19.1   8.4   1.6  61.2     7.9    8.9   10  
#> 10 Alas~ Arcti~  1903 Spring  14.6   9.9   0.8  74.7     4.5    5.7    7  
#> # ... with 33,790 more rows, and 6 more variables: Pct_25 <dbl>,
#> #   Pct_50 <dbl>, Pct_75 <dbl>, Pct_90 <dbl>, Pct_95 <dbl>, Pct_975 <dbl>

It is important to note that seasonal and annual aggregate statistics are not simple means of monthly statistics. Each of these collections is independently derived from climate variable spatial probability distributions at their respective temporal resolutions.

This means that, for example, monthly temperature quantiles for the Arctic Tundra are calculated from monthly spatial temperature distributions and winter temperature quantiles are calculated from the applicable 3-month period spatial temperature distributions. While the mean is invariant to this difference, other statistics are not. The 95th percentile winter temperature across space during the three month period does not result from taking the average of three monthly 95th percentile values.

A final note on season and annual statistics is that, like monthly statistics, these remain period totals for precipitation and period averages for temperature variables.

Decadal data

In contrast to monthly, seasonal and annual resolution statistics, all three of which are computed across space at their respective temporal resolutions, decadal statistics are in fact simple decadal averages of monthly, seasonal and annual data. For example, the decadal mean of the 95th percentile monthly temperature across a region is just that; the mean of the ten annual 95th percentile monthly values in a decade.

For this reason, decadal data is not requested with climdata by specifying it with time_scale, which always pertains to annual and intra-annual (monthly or seasonal) time steps. Instead, use decavg = TRUE. This is FALSE by default so it did not previously need to be specified. When requesting decadal averages, there is still the choice of whether those averages should be of monthly, seasonal or annual resolution statistics.

x <- climdata(id = "ar5stats", area = "Arctic Tundra", time_scale = "seasonal", 
    decavg = TRUE)
select(x, -c(1:3))
#> # A tibble: 3,776 x 17
#>    Group Region Decade Season  Mean    SD   Min   Max Pct_025 Pct_05 Pct_10
#>    <chr> <chr>   <int> <fct>  <dbl> <dbl> <dbl> <dbl>   <dbl>  <dbl>  <dbl>
#>  1 Alas~ Arcti~   1900 Winter  18.1   8.8   1.6  71.1     6.6    7.6    8.8
#>  2 Alas~ Arcti~   1900 Spring  14.2   8.7   0.5  70.6     3.7    4.7    6.5
#>  3 Alas~ Arcti~   1900 Summer  46.3  27.6   7.8 188.     12.4   13.9   16.6
#>  4 Alas~ Arcti~   1900 Autumn  30.3  18.2   4.1 144.      8.6   10.1   13.3
#>  5 Alas~ Arcti~   1910 Winter  20.1  13.8   1   107.      5.2    6.5    8.5
#>  6 Alas~ Arcti~   1910 Spring  16.5  12.3   0.6  98.4     4.1    5      6.7
#>  7 Alas~ Arcti~   1910 Summer  47    29     6.9 198.     11.8   13.2   16.3
#>  8 Alas~ Arcti~   1910 Autumn  32.5  20.9   3.6 162.      8.7   11     14.3
#>  9 Alas~ Arcti~   1920 Winter  22.7  15.1   0.9 102       5.1    6.2    7.6
#> 10 Alas~ Arcti~   1920 Spring  16.1  12.2   0.4  84.7     2.3    2.9    4  
#> # ... with 3,766 more rows, and 6 more variables: Pct_25 <dbl>,
#> #   Pct_50 <dbl>, Pct_75 <dbl>, Pct_90 <dbl>, Pct_95 <dbl>, Pct_975 <dbl>

Mulitple locations

It is possible to obtain climate data sets that include multiple locations using climdata, but this is only available for smaller data sets where it would not lead to a cumbersome data download. Currently, the only data set for which multiple locations can be returned at once is the decadal averages data set in the ar5stats collection. By specifying area = "points" rather than a specific point location, a table is returned containing data for all 3,867 point locations.

There are two requirements that help to ensure this does not lead to an excessive download size or waiting time. As mentioned, this is only available for the highly aggregated decadal data. Without setting decavg = TRUE, attempting to specify area = "points" will throw an error. The second requirement is that only a single climate variable will be returned. You can always call climdata multiple times for additional variables if desired. Therefore, you should also specify the variable argument. If you do not, mean temperature (tas) is assumed.

x <- climdata(id = "ar5stats", area = "points", time_scale = "annual", decavg = TRUE, 
    variable = "tas")

To show the result more effectively, filter the table to a specific combination of other factors.

filter(x, RCP == "6.0" & Model == "GFDL-CM3" & Decade == 2050)
#> # A tibble: 3,867 x 8
#>    RCP   Model    Var   Group  Location      Decade Season  Mean
#>    <fct> <fct>    <fct> <chr>  <chr>          <int> <fct>  <dbl>
#>  1 6.0   GFDL-CM3 tas   Alaska Adak Station    2050 Annual   7.2
#>  2 6.0   GFDL-CM3 tas   Alaska Afognak         2050 Annual   8.6
#>  3 6.0   GFDL-CM3 tas   Alaska Akhiok          2050 Annual   8.4
#>  4 6.0   GFDL-CM3 tas   Alaska Akiachak        2050 Annual   3.4
#>  5 6.0   GFDL-CM3 tas   Alaska Akiak           2050 Annual   3.3
#>  6 6.0   GFDL-CM3 tas   Alaska Akutan          2050 Annual   7.6
#>  7 6.0   GFDL-CM3 tas   Alaska Alakanuk        2050 Annual   3.8
#>  8 6.0   GFDL-CM3 tas   Alaska Alatna          2050 Annual  -1.7
#>  9 6.0   GFDL-CM3 tas   Alaska Aleknagik       2050 Annual   5.4
#> 10 6.0   GFDL-CM3 tas   Alaska Aleut Village   2050 Annual   8.6
#> # ... with 3,857 more rows

Analyzing SNAP climate data

The snapclim package is essentially a data package. It provides a simplified and convenient interface in R enabling easy access to a large amount of SNAP climate data spread over multiple collections, stemming from different sources and existing for different purposes. It does not provide functionality for performing statistical analysis and graphing, which is provided by R in general. Useful stock functions pertaining specifically to analyzing SNAP data, including climate data accessed with snapclim, are available in the snapstat package (under development).

Climate distributions

For more information on the climate probability distributions from which regional climate statistics are calculated, see the snapdist package (under development) or SNAP’s Climate Analytics Shiny app for working examples.