The rvtable package provides a special type of data frame subclass with associated functionality for relatively convenient storage and manipulation of random variables. The emphasis is on distributions of continuous random variables derived from relatively large samples, though discrete random variables are also handled.

An rvtable contains a column defining values representing or sampled from a probability distribution and, usually (see below for exception), a column of respective probabilities. Additional columns are treated as ID columns and represent factors or categorical variables.

Motivation

While this package can be used for organizing small samples, there is not much point. The main motivation for rvtable is relatively seamless storage and manipulation of empirically estimated continuous probability distributions deriving from relatively large samples or data sets.

By relatively large, I mean cases where it is both substantially more statistically and computationally efficient to store and subsequently work with estimated probability distributions that are derived from and sufficiently representative of the source data than to work directly on the source data itself.

For an example scenario, see the use case example further below in the vignette.

Note that it is of no benefit to known distributions with closed mathematical form expressions because there is never a need to lug a ton of such data around in the first place. For example, a random normal distribution can be sampled with rnorm at any time. rvtable is helpful for empirical samples which are large and messy, of a complicated form or mixed distributions, which cannot be reduced to a known or simple combination of known distributions, where an efficient snapshot of the distribution is helpful to avoid juggling excessive amounts of data from one analysis stage to the next while retaining sufficient distributional information.

Linking code and concept

It is helpful to make the connection between operations performed by code and the probability concepts they parallel. The entirety of an rvtable as described above represents a joint probability distribution of the primary continuous or discrete random variable whose values and probabilities are represented by the values and probabilities columns and any categorical variables whose levels propagate any ID columns.

Filtering rows of an rvtable to specific combinations of ID variable levels yields a subset of the original table that represents a conditional distribution of the primary random variable. Similarly, integrating out, marginalizing over, or collapsing the table over the unique levels of one or multiple ID variables is analogous to calculating a marginal distribution of the primary random variable.

Package functionality

This package offers a collection of functions that assist with manipulating tables of random variables in these and other ways. There is a constructor function, rvtable, for generating rvtables from different kinds of input data. An rvtable can be in probability distribution or sample form. This means that an rvtable will contain two columns, one of values and one of associated probabilities, when representing a distribution, but only the values column once sampled from. Form is tracked by object attributes.

Other functions available in the package offer seamless transitions between forms, carrying out various operations such as sampling (sample_rvtable) from densities and marginalizing over levels of categorical variables in an rvtable (merge_rvtable or marginalize). Users can maintain control over the resolution of the data via function arguments that correspond to any sampling or empirical density estimation steps that a function may perform. Probability mass functions can be computed and stored in a new rvtable (inverse_pmf) by inverting another rvtable, yielding the pmf of an ID variable conditional on a range of values of a continuous primary variable and any other ID variables.

Usage

This introduction covers the following topics:

  • Create various rvtable objects.
  • Examine rvtable class attributes.
  • Sample from a distribution-type rvtable, yielding a new, sample-type rvtable.
  • Compute marginal distributions on an rvtable.
  • Compute an inverse empirical probability mass function from an rvtable.
  • Perform repeated cycles of distribution estimation and resampling.
  • Explore a toy example of a realistic use case.

Creating rvtable objects

There are several ways to create an rvtable from the constructor function, rvtable. Below are examples of continuous and discrete random variables stored in rvtables. In the first example, the input, x, is simply a numeric vector representing a sample. If density.args is not specified, the defaults are those used by density. See the help documentation for density for details. The second call below limits the sample from n=512 to n=50 and smooths the distribution by setting adjust=2. We can also check to see if an object has the rvtable class.

library(rvtable)
# basic samples from continuous and discrete RVs
x <- rnorm(100)
rvtable(x)
rvtable(x, density.args = list(n = 50, adjust = 2))
is.rvtable(rvtable(x))

This reveals that when working with “continuous” random variables, the distribution is always discretized to some degree. When high fidelity is required, increase n. This is increasingly important when working with a long data manipulation chain requiring many iterations of density estimation or sampling, such as marginalizing over multiple categorical variables or if it is critical to maintain the shape of a multimodal distribution with high precision.

The default is to assume x is continuous. In the next case, it is more ambiguous. While all values in this example are clearly discrete integers, they are still treated as samples from a continuous probability distribution and will be modeled internally by density as such unless discrete=TRUE.

For discrete random variables, probabilities associated with values of x can be passed to y. Alternatively, x can be passed without y for the same purpose if x has the attribute probabilities. Note that in either case, it is no longer necessary to explicitly pass discrete = TRUE. The discrete nature of x is inferred either from its attributes or from y. When x is a numeric vector, discrete = TRUE is only required when probabilities are not passed by y or by x attributes and x is a basic unweighted sample (but possibly with repeating values). There are a number of convenient functions for accessing attributes of an rvtable. See ?helpers as well as the next section for details.

x <- sample(1:10, size = 10000, replace = TRUE, prob = sqrt(10:1))
rvtable(x, discrete = TRUE)  # required
x <- 1:5
probs <- c(0.1, 0.2, 0.3, 0.15, 0.25)
rvtable(x, y = probs)  # discreteness inferred from y
attr(x, "probabilities") <- probs
rv <- rvtable(x)  # discreteness inferred from attributes
rv
rvattr(rv)

Also note above that probabilities need not sum to one in the discrete case. In this case relative weights suffice and will be rescaled internally by sample rather than passed to density as is the case with continuous random variables.

Increasingly the complexity of the input to rvtable, we arrive at the more common ways to construct an rvtable: from and existing data frame. First is a data frame of values with equal probability.

Note that by leaving out discrete=TRUE in the call to rvtable below, as with a numeric input, x, the data frame input is also assumed to represent a continuous random variable sample. It will be estimated as if a tiny sample from a uniform pdf rather than a fully represented uniform pmf.

x <- data.frame(Val = 1:10, Prob = 0.1)
rv1 <- rvtable(x)
rv1
rvattr(rv1)
rv2 <- rvtable(x, discrete = TRUE)
rv2
rvattr(rv2)

Lastly, we have the addition of an ID column, id, representing a categorical variable. The column names referring to values and their associated probabilities can also be changed.

x <- data.frame(id = rep(LETTERS[1:2], each = 10), v1 = rep(1:10, 2), p1 = c(c(10:1)^2, 
    sqrt(1:10)))
rv <- rvtable(x, Val = "v1", Prob = "p1")
rv
rvattr(rv)

Factors or ID columns like id above, if present, need not be factor in the literal R sense. They need not even be character, but they are columns in an rvtable that will be treated as some kind of discrete variable that can be meaningfully grouped by, for example.

The general assumption is that an rvtable is intended to have a column of values and a column of probabilities, which may describe a continuous or discrete random variable, and optionally some ID variables in other columns that essentially help to define conditional distributions.

In concluding, here are three important properties of rvtable worth being clear on:

  • An rvtable can be created from a data frame that contains a values column but no corresponding probabilities column, analogous to when x is a basic numeric samples vector.

  • rvtable always creates an rvtable in distribution form regardless of input, unless provided a basic sample with no probabilities and force.dist=FALSE.

  • rvtable ignores grouping on purpose. See below.

A note of caution and clarification

Do not include other columns of continuous data or anything you do not want to use as an explicit ID variable alongside the values and probabilities columns; rvtables and the functions that manipulate them work with one implicit random variable defined by one pair of values and probabilities columns. While ID columns add more variables to an rvtable, the focus remains on the main random variable. Any column in an rvtable not specified as a values or probabilities column will be treated as an ID variable.

This is the initial stopgap to using rvtables as not intended. If you pass a data frame whose rows have been grouped by any categorical variables using dplyr::group_by, this grouping information is ignored, but will be passed through. rvtable forcibly groups by all categorical variables present in the input, regardless; this means all columns present that are not one of the two values or probabilities columns.

Subsequent functions honor grouping information, but the assumption and intent here is that any data provided to rvtable to initially construct and rvtable object is relevant, either the key variable or an important ID variable. Furthermore, it is reasonable to assume that any present ID variables are already distinguishing different sets of values and probabilities in a table and it is not intended for the rvtable constructor to also apply any marginalizing operations if it is passed a data frame that has no or only partial grouping.

rvtable always separates all apparent combinations of ID variables’ levels and groups by all of them. When returning an rvtable object, it will reassign any grouping that might have been part of the input rather than strip grouping or retain full grouping.

Examine rvtable attributes

The rvtable class is a class added to a data frame. rvtables are also tibbles (class tbl_df) from dplyr, specifically. The key addition is a more restricted, contextualized format and attachment of a number of class attributes that assist in describing and manipulating rvtables.

[More needed…]

Sampling rvtables

Sampling on an rvtable requires a density-type rvtable. This is the more common form. Sampling results in a new rvtable object that is of sampling-type, meaning its values column represents raw sample values and there is no corresponding column of probabilities tied to those values. This is the less common rvtable type and tends to be explicitly created and used at the end of a chain of data manipulation steps when it finally comes time to do something with the data such as plot it or pass samples to an analysis.

Sampling is done using sample_rvtable. Below, small samples are drawn from rvtables containing continuous and discrete random variables, respectively.

# continuous RV
x <- rvtable(rnorm(1000))
x
rvattr(x)$tabletype
y <- sample_rvtable(x, n = 10)
y
rvattr(y)$tabletype

# discrete RV
x <- rvtable(sample(1:100, 50), discrete = TRUE)
y <- sample_rvtable(x, n = 10)
sample_rvtable(y, n = 8, resample = TRUE)

In the above example, the tabletype entry in the attributes of y is different from that in x. After sampling from the density-type rvtable, the resultant rvtable is in sample form. In the discrete example, sample_rvtable is called on an rvtable that is already in sample form. In this case samples are taken directly from the rvtable values column unless resample=TRUE, which forces the additional step of re-estimation of the pmf or pdf prior to resampling. This is especially useful in the continuous case when the user requires more control over the sampling distribution via density.args prior to resampling.

Returning to the data frame from earlier, sampling is performed by group.

x <- data.frame(id = rep(LETTERS[1:2], each = 10), v1 = rep(1:10, 2), p1 = c(c(10:1)^2, 
    sqrt(1:10)))
rv <- rvtable(x, Val = "v1", Prob = "p1")
sample_rvtable(rv, n = 5)

Aside from x, other arguments to sample_rvtable include:

  • resample as described.
  • n, the sample size.

The sample size defaults to 10,000 because the typical context for the rvtable package is empirical estimation of continuous probability distributions with high-fidelity/flexibility. While sampling comes after density estimation, it is also helpful to retain larger samples if densities are subsequently re-estimated.

  • interp, which defaults to TRUE, applies linear interpolation between samples prior to sampling.
  • n.interp, the sample size resulting from interp=TRUE, defaults to 100,000.
  • decimals is available for rounding samples.
  • density.args takes a list of arguments to pass to density to override defaults, just as in rvtable.

Like the constructor, rvtable, sample_rvtable also forcibly applies its sampling to each identifiable group and does not acknowledge any existing grouping information that has been added to a data frame. The function will not presume that the user intends to implicitly marginalize over other ID variables; computing marginal distributions is intended always to be performed explicitly. For one thing, ID variables might stem from weighted samples of random variables and the user must be aware of instances where it would be incorrect to automatically merge two ID variables with accounting for this. For singling out or collapsing over ID variables, we turn to the next section.

Computing marginal distributions on rvtables

There is not much to say about conditional distributions because this is relatively trivial. A conditional distribution of the random variable described by the values and probabilities columns given, say, an ID variable being equal to one or some subset of some of its levels, is obtained by simply subsetting the rvtable to those rows.

On the other hand, marginalizing over levels of an ID variable to yield a marginal distribution of the primary random variable is more complex. This is done with marginalize, which is a wrapper around merge_rvtable. While the latter is exported by the rvtable package and can make sense to use directly, it is typical to use marginalize, specifically.

Below, marginalize is used to collapse an rvtable by integrating out or marginalizing over categorical variables. The margin argument describes which variables to marginalize over. In the first example, marginalize collapses the rvtable over the levels of id1 and id2.

x <- data.frame(id1 = rep(LETTERS[1:5], each = 4), id2 = factor(c("low", "high")), 
    id3 = rep(1:2, each = 2), Val = rep(1:10, each = 20), Prob = rep(sqrt(1:10), 
        each = 20))
rv <- rvtable(x)
marginalize(rv, margin = c("id1", "id2"))

get_levels can be used to quickly check the unique levels of an ID variable. More informative is to call get_weights, which returns levels and their respective weights. Note that ID columns do not have to be of class factor to be treated as ID/categorical/factor variables, etc.

A reason weights are helpful is because when marginalizing over an ID variable, it is possible to pass a vector of weights or probabilities when integration should account for weighted samples. For example, an rvtable may have equal numbers of rows describing the probability distribution of a continuous random variable conditional on each of two ID variable levels. However, it may be known by the user that each of these levels do not contribute equally to the marginal distribution.

Weights are stored as an rvtable attribute and can get retrieved and set with get_weights and set_weights. Weights should not be confused with the probabilities column in an rvtable that describes the actual random variable probabilities. Weights can also be set for ID variables as part of the call to the rvtable constructor, which calls set_weights internally. By default, weights of each ID variable’s levels are equal (set to one, though what matters is only that they are equal). Unequal weights must be assigned explicitly.

get_levels(rv, "id1")
get_weights(rv, "id1")
wts <- data.frame(levels = LETTERS[1:5], weights = c(1, 1.5, 2, 4, 1))
x <- set_weigths(x, id = "id1", weights = wts)
marginalize(rv, "id1")

Computing an inverse pmf

An inverse pmf can be computed using inverse_pmf. Continuing with the previous rvtable, we obtain the pdf of id1 conditioned on a range of values of the continuous random variable described by the original rvtable as well as the other present factors. The values column is now categorical and takes on the levels associated with id1 in the input.

y1 <- inverse_pmf(rv, val.range = c(5, 8), "id1")
y1

library(dplyr)
x2 <- filter(rv, id2 == "low" & id3 == 1) %>% select(-id2, -id3) %>% rvtable
y2 <- inverse_pmf(x2, c(5, 8), "id1")
y2

The difference between the two calls to inverse_pmf above is that in the second call, the rvtable has first been subset to one combination of levels of id2 and id3. In both cases, the pmf of id1 is conditional on values of id2, id3 and the continuous random variable represented by values and probabilities in the original rvtable; it is merely not including all levels of id2 and id3 in the second example.

Here is another, highly simplified example where the primary variable is now discrete and, like the single additional ID variable, has only two unique values:

x2 <- data.frame(id1 = c("A", "B"), Val = c(1, 1, 2, 2), Prob = c(0.25, 0.5, 
    0.75, 0.5)) %>% rvtable(discrete = TRUE)
inverse_pmf(x2, 1, "id1")
inverse_pmf(x2, 2, "id1")
inverse_pmf(x2, 1:2, "id1")

For more information about the rvtable package, see the help documentation

Density re-estimation and resampling

It is important to be aware of the level of possible signal degradation that can occur with some combination of:

  • the density resolution being too low, leading to a more discretized density.
  • the bandwidth being too high, leading to a more highly smoothed, less flexible density.
  • the number of times that a density is sampled from and the size of the sample
  • the number of times a density is re-estimated from a sample, such as happens when computing marginal distributions, which requires sampling from multiple densities, combining weighted or unweighted samples, and estimating a density from the pooled sample.

In general, the fewer cycles of sampling and density estimation the better. However, if is not difficult to force retention of the original density estimation within acceptable tolerances for an analysis if the sampling size is large enough and the density estimation is fine-grained and flexible enough. The important thing is that what is “enough” is always something that should be directly assessed. What is enough will be different for different types of data and random variables as well as varying suitability for different purposes.

In the examples below, the same data sample is used with different combinations of parameters for density estimation and sampling. A number of resampling and re-estimation cycles are repeated in each case and signal degradation is plotted for select iterations. In typical use cases, re-estimation is restricted to just a few times; examples where there is a reason to perform this kind of cycling many more times are difficult to envision. Nonetheless, the number of iterations used here is relatively extreme.

Use case example

As an example, say we have a large collection of high resolution maps that each contain many millions of pixels. The entire set may contain billions or even trillions of pixels, not to mention that the degree of spatial autocorrelation is so extreme that from the outset we know that for many purposes the data set, while containing much data, contains relatively little information.

Various statistical analyses that may need to be performed with data may not require anywhere near this amount or resolution of data. There is much room for data reduction and the key is striking the right balance for the data and the type of analysis. However, merely accessing it all to get started can be a computational challenge, one that is ideally revisited as seldom as possible for subsequent analyses. It is much easier to revisit a smaller, derived data set that is still statistically sufficient for the analysis at hand.

At one extreme, someone with a programming background but no knowledge of probability and statistics might take a brute force approach, attempting, often in vain and not without wasting vast amounts of time, to use every pixel even if their goal is to simply calculate the mean all pixels when using a tiny sample would do just as well. When asked later for the standard deviation, they will repeat the brute force attack on the massive data set and obtain the next value.

At the other extreme, rather than applying judicious sampling and efficient data reduction methodologies, one might simply reduce everything in one go to statistics such as the mean and standard deviation. But like the previous instance, something new and unanticipated is often required later. Reducing all the data down to a set of statistics can preclude all kinds of analyses, requiring the analyst to revisit the burden of accessing the massive source data yet again, all the while wishing they could have foreseen the need last time for whatever they are accessing it for this time around.

rvtable bridges this gap for many use cases like this, where the happy medium, the Goldilocks zone, is to model and store empirical estimates of probability distributions from high resolution source data or massive sample or population data. Downstream analyses can depend on such a derived data set, not needing to go to the source to be able to compute arbitrary statistics from full distributional information, nor being hamstrung by overly aggregated data.

Specifically, imagine using the R function, density, to empirically estimate the continuous probability distribution of a set of observations. It returns a list with x and y vectors that together describe the estimated density curve. If signal fidelity is of paramount importance and any further data aggregation should be postponed until later in an analysis pipeline, one can model a higher resolution estimated density with greater flexibility by increasing the number of points and/or decreasing the bandwidth. Whereas doing so might necessitate a x and y vectors containing 1,000 values each and there might be 1,000 of these for 1,000 different data sets, resulting in carrying 2 million total values through the bulk of a processing chain and steps of an analysis, these estimated density curves could capture all the information needed from source data sets that are far larger.