Using Normal Prediction Intervals on Symmetric Beta-distributed data

Author

Shirley Toribio

Introduction

Symmetric beta distributions and normal distributions are very similar to each other in shape. Due to this, in this project I examine how a 95% prediction interval that assumes data originates from a normal distribution would fare in terms of prediction accuracy and actual coverage probability of beta-distributed data. I define a prediction interval with “good” prediction accuracy as one that covers the actual mean of the data’s parent distribution, and a prediction interval with good coverage probability as one that covers approximately .95 of the beta distribution. All data is generated from beta distributions with mean .5.

Simulations

First, I tidy up.

library(tidyverse)

Then, I define a few functions that will help me in my task.

PI <- function(data, coverage_prob){ 
  #Generates a normal prediction interval with an intended coverage probability of coverage_prob based on a vector of numeric data
  n <- length(data)
  lower_tscore <- qt((1-coverage_prob)/2, df = n - 1)
  upper_tscore <- qt(((1-coverage_prob)/2) + coverage_prob, df = n - 1)
  avg <- mean(data)
  stan_d <- sd(data)
  lower_bound <- avg + lower_tscore*stan_d * sqrt(1 + (1/n))
  upper_bound <- avg + upper_tscore*stan_d * sqrt(1 + (1/n))
  return(data.frame(PI_percentage = coverage_prob, lower = lower_bound, upper = upper_bound))
}

PI is a function that takes a vector of numeric data and produces a high-density prediction interval intended to cover the proportion of said data’s parent distribution given by coverage_prob. This prediction interval is created under the assumption that the data is normally distributed (although we know it is not). Let \(s\) be the sample standard deviation of our data, \(n\) be the size of our sample, \(\hat{p}\) be the mean of our data, and \(t_{p,n-1}\) be the quantile value associated with the highest density coverage probability \(p\) in a \(t\) distribution with \(n-1\) degrees of freedom. If we assume that our data is normal, then the highest density prediction interval with coverage probability \(p\) is best predicted as

\[\hat{p} \pm t_{p,n-1}*s*\sqrt{1+\frac{1}{n}}\] PI uses this formula to generate the bounds of its prediction interval.

one_beta_simulation <- function(n, alpha, beta, pi_prop){
  #Assesses prediction accuracy and actual coverage probability of a normal prediction interval when used on a vector of numeric data of size n. The numeric data is generated from a beta distribution with parameters alpha and beta.
  
  cover_df <- PI(rbeta(n, alpha, beta), pi_prop)
  
  cover_prop <- pbeta(cover_df[1, "upper"], alpha, beta) - pbeta(cover_df[1, "lower"], alpha, beta) #this is the proportion of the data's parent distribution that is actually covered by the normal prediction interval generated for said data.
  
  mean_in_interval <- .5 >= cover_df[1, "lower"] & .5 <= cover_df[1,"upper"]
  param_df <- data.frame(cover = cover_prop, alpha = rep(alpha, nrow(cover_df)), beta = rep(beta, nrow(cover_df)), mean_in_interval = mean_in_interval)
  df <- cbind(cover_df, param_df)
  return(df)
}

one_beta_simulation randomly samples \(n\) data points from a beta distribution with parameters alpha and beta and calculates a normal prediction interval for said data using function PI. It then determines what proportion of the data’s parent beta distribution is covered by the prediction interval and checks if the mean of said distribution is covered by the prediction interval.

beta_sims_n <- function(n){
  #Iterates over a vector of possible alpha = beta values and applies one_beta_simulation to each possible value of alpha/beta. All simulations use data of sample size n.
  df1 <- map(parameters,\(param) one_beta_simulation(n, param, param, pi) ) %>%
  list_rbind()
  df2 <- data.frame(n = rep(n, nrow(df1)))
  df <- cbind(df2, df1)
  return(df)
}

beta_sims_n maps over a vector of potential parameter values for a beta distribution with mean \(.5\) and uses one_beta_simulation per each iteration, with all iterations taking the same value for the data’s sample size. It randomly samples \(n\) points multiple times from different beta distribution with mean .5 and generates normal prediction intervals for each sample. It also determines the proportion of the each sample’s parent distribution that said intervals actually cover.

Time to simulate over different parameter values and sample sizes!

parameters <- seq(5, 200, by = 2)
n <- 2:500
pi <- .95

beta_df <- map(n, \(n) beta_sims_n(n)) %>%
  list_rbind()

This is a glimpse at the results of the simulations.

rows <- sample(1:nrow(beta_df), 10)
map(rows, \(i) beta_df[i,]) %>%
  list_rbind()
     n PI_percentage     lower     upper     cover alpha beta mean_in_interval
1  210          0.95 0.4071398 0.5942892 0.9683227    65   65             TRUE
2  346          0.95 0.4326949 0.5659221 0.9639460   123  123             TRUE
3  285          0.95 0.4423621 0.5587756 0.9231621   115  115             TRUE
4  174          0.95 0.4276656 0.5688712 0.9633999   109  109             TRUE
5  462          0.95 0.4453036 0.5547893 0.9537816   165  165             TRUE
6  304          0.95 0.3918273 0.6023088 0.9556359    45   45             TRUE
7  207          0.95 0.3991949 0.5987504 0.9576941    51   51             TRUE
8  345          0.95 0.3302761 0.6630862 0.9524835    17   17             TRUE
9  300          0.95 0.3409325 0.6655893 0.9464347    17   17             TRUE
10 364          0.95 0.4378304 0.5609452 0.9451670   121  121             TRUE

This is a random sample of rows from the actual dataset, which has 48902 rows and 8 columns. Each row corresponds to a simulated random sample of size n from a beta distribution with parameters alpha and beta. For each random sample, a normal prediction interval was generated with bounds “lower” and “upper”. “PI_percentage” refers to the intended coverage probability and “cover” refers to the actual coverage probability of said prediction interval over the beta distribution the data was generated from. “mean_in_interval” is a binary variable that states if the prediction interval covered the mean of the distribution.

Insights

n_means_df <- beta_df %>%
  mutate(diff = cover - PI_percentage) %>%
  group_by(n) %>%
  summarize(mean = mean(diff), mu_in_interval = sum(mean_in_interval)/n())
avg <- mean(n_means_df$mean)
ggplot(n_means_df, aes(x = n, y = mean)) + 
  geom_point() + 
  geom_hline(yintercept = 0, col = "black")  +
  geom_hline(yintercept = avg, col = "red")  +
  labs(
    x = "sample size",
    y = "mean difference between actual and intended coverage",
    title = "Figure 1",
    subtitle = "difference between actual and intended coverage probability based on sample size",
  )

A scatter plot titled difference between actual and intended coverage probability based on sample size. The x-axis is sample size and the y-axis is mean difference between actual and intended coverage probability. Each point corresponds to the mean difference between actual and intended coverage probability for simulations of a specific sample size. There are two horizontal lines with slopes of zero. One line has an intercept of zero and the other line has an intercept at the mean difference between actual and intended coverage probability over all simulations. The latter line is above the former line. Points have more vertical spread the closer their x-value is to zero and less vertical spread as their x-value increases.

Figure 1 graphs the mean difference between the actual coverage probability and intended coverage probability (calculated as: actual coverage probability - intended coverage probability) per each sample size from 2 to 500. A negative mean difference indicates the actual coverage probability tended to be less than the intended coverage probability, which is undesirable, while a positive mean difference indicates the actual cover probability tended to be more than the intended coverage probability. The mean difference between actual and intended coverage probability seems to converge to a value a little bit above 0 as the sample size increases, indicating that actual coverage probability tends to be more than the intended coverage and that this is more so as sample size increases.

For small sample sizes the mean difference between actual and intended coverage probability has a lot more variance around 0 than it does for larger sample sizes, which can be attributed to sampling variability. Overall, absolute differences in actual coverage probability and intended coverage probability don’t appear to extend beyond .02, indicating normal prediction intervals tend to have good coverage probability for symmetric, beta distributed data.

n_means_df <- beta_df %>%
  mutate(diff = cover - PI_percentage) %>%
  group_by(n) %>%
  summarize(mean = mean(diff), mu_in_interval = sum(mean_in_interval)/n()) %>%
  filter(n %in% 1:30)
ggplot(n_means_df, aes(x = n, y = mu_in_interval)) + 
  geom_point()  +
  labs(
    x = "sample size",
    y = "proportion of mu-inclusive prediction intervals",
    title = "Figure 2",
    subtitle = "proportion of mu-inclusive prediction intervals based on sample size"
  )

A scatter plot titled proportion of mu-inclusive prediction intervals based on sample size. The x-axis is sample size and the y-axis is proportion of mu-inclusive prediction intervals. Each point corresponds to the proportion of simulations of a specific sample size that include the mean of the data's parent distribution, which is half. The x-axis has an interval of two to thirty. The points all have a y-value of one when the x-value is four or greater.

Figure 2 plots the proportion of prediction intervals that cover mu = .5 per each sample size from 2 to 30. These points converge to 1 at a sample size of around 4, meaning for sample sizes of 4 or greater it is probable that all random samples with those samples sizes will produce normal-prediction intervals that cover the mean of the beta distribution said data originated from.

ns_of_interest <- c(5, 10, 30, 50, 100)
beta_df_2 <- filter(beta_df,n %in% ns_of_interest) %>%
  mutate(sample_size = as.factor(n))
ggplot(beta_df_2, aes(x = alpha, y = cover, color = sample_size)) +
  geom_point() +
  geom_smooth(aes(line = n ), se = FALSE) +
  labs(
    x = "value of alpha and beta",
    y = "actual coverage probability",
    title = "Figure 3",
    subtitle = "actual coverage probability based on parameter values"
  )

A scatter plot titled actual coverage probability based on parameter values. Each point represents one simulation and has a specific color. There are five colors representing sample sizes five, ten, thirty, fifty, and one hundred. There are also five lines in the plot. Each line is the line of best fit for points corresponding to a specific color. The x-axis is value of alpha and beta, and the y-axis is actual coverage probability. All the lines tend to overlap. The points corresponding to sample sizes five and ten are more spread out vertically than the points corresponding to other sample sizes.

Figure 3 plots the actual coverage probability per each value of alpha and beta. Points and lines are colored based on sample size, which are described in the legend. Alpha and beta have the same values, by the way. All the lines of best fit seem close to being horizontal, meaning that there most likely is not a relationship between the values of alpha/beta and the actual coverage probability of the data’s prediction interval. Based on the large overlap between lines, normal prediction intervals based on small amounts of data don’t seem more likely to deviate from the intended coverage probability of the data’s parent symmetric beta distribution, although they do have larger amounts of variability around their mean coverage probability.

Conclusion

By generating data from symmetric beta distributions with mu = .5 and making normal prediction intervals based on this data, I was able to assess the predictive accuracy and actual coverage probability of normal prediction intervals when applied to symmetric-beta data. Even given small sample sizes, normal prediction intervals seem to have good predictive accuracy and good coverage probability.