SE Side-Adventure - Traditional vs Bayesian Bootstrap SEs

Notes

Bootstrap

Standard Errors

Robust

Clustered

Bayesian

Random Weights

Boots!

Author

C. Luke Watson

Published

September 13, 2024

This is side-adventure to the last two posts. I wanted to code up a bootstrap using NLS, but honestly I needed to practice doing it with just OLS. We all learn about bootstrap, but the Bayesian bootstrap from Rubin 1981 was probably unknown to most prior to Peter Hull’s 2022 twitter thread.

I primarily learned this in classes or saw on Twitter, so I have no specific citations. However, I did end up using Claude.ai more than the last two posts to debug errors and find more efficient solutions.

Traditional vs Bayesian Bootstrap Intuition

I don’t know as much about bootstrapping as I would like, so this will be heavy on intuition.

The traditional bootstrap uses the idea that if there is sampling variation that we want to account for because our data is not iid, then we can do that sampling process using our data by sampling it according to a assumed sampling process. To implement, (1) sample data \(B\) times with replacement, (2) for each sample \(b\), calculate the main statistic of interest (e.g., \(\hat{\beta}_{b}\)), (3) and then calculate bootstrap SE via the SD of the bootstrapped statistics (e.g., \(\mathsf{sd}_{b}(\hat{\beta}_b)\)) or calculate CIs. For example, for robust SEs we just resample the observations with replacement; while for cluster robust SEs, we sample the clusters with replacement. This works out well enough with most models; however, it can get tricky when the model is non-linear or if there are fixed effects. Sometimes the estimation could fail, which is very annoying… especially if one did not code defensively.

The Bayesian bootstrap essentially replicates the sampling process but with sample weights that mimic the resampling. To implement, (1) create \(B\) sets of random weights \(\omega_{b}\) that follow the Dirichlet distribution (more later), (2) estimate a weighted version of the main statistic of interest, and then (3) calculate bootstrap SE via the SD of the bootstrapped statistics or calculate CIs. This makes an intuitive sense, but it is not actually as straight-forward as one may initially think. I won’t get into any of that. The main benefit is that typically weighting the data is a much simpler approach and much less likely to cause estimation failure.

Set up (same as Pt2)

I will simulate the data the same as in Pt2.

# clear workspace
rm(list=ls())
gc()

          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  616268 33.0    1254160   67         NA  1024795 54.8
Vcells 1114948  8.6    8388608   64     102400  1839961 14.1

# libraries
library(tidyverse)  # general data manipulation

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)      # model results to data.frame for compact viewing
library(optimx)     # optim + extras
library(sandwich)   # Adjust standard errors
library(lmtest)     # Recalculate model errors with sandwich functions with coeftest()

Loading required package: zoo

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

library(rlang)      # helps with functions


Attaching package: 'rlang'

The following objects are masked from 'package:purrr':

    %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
    flatten_raw, invoke, splice

# Set random seed for reproducibility
set.seed(101)

# obs - high so that coeficients are well estimated
n <- 10000

# clusters
num.clusters <- 50
grps <- sort(floor(runif(n)*num.clusters)+1)

# cluster effects
cluster.y.effect <- rnorm(num.clusters,0,2)
cluster.u.effect <- rnorm(num.clusters,0,2)
cluster.x1.effect <- rnorm(num.clusters,0,0.2)
cluster.x2.effect <- rnorm(num.clusters,0,0.2)
cluster.x1.x2.effect <- rnorm(num.clusters,0,0.2)

# two x vars (will add a constant)
x1=rnorm(n,1,1 + cluster.x1.effect/3)
x2=rnorm(n,1,2)

# clustered errors
sd <- runif(n, min=0.5, max=4)
s.true <- sd*(x1/5 + 1) # need to ensure that all s's are positive
error <- rnorm(n,0,s.true) +
  cluster.u.effect[grps] +
  x1*cluster.x1.effect[grps] + x2*cluster.x2.effect[grps] +
  x1*x2*cluster.x1.x2.effect[grps]
error <- error - mean(error)

# true coefficents
coef.true <- c(1,-4,2)

#Generate the dependent variable
y1 <- coef.true[1] + (coef.true[2]*x1) + (coef.true[3]*x2) + error

Canned SEs

Here, I’ll quickly show how to get non-robust, robust, and clustered SEs. Note, this approach uses lmtest and sandwich. If I had a lot of FEs, then I would likely use feols package.

# Model, non robust
m1.nonrobust <- lm(y1 ~ x1 + x2)

# Robust standard errors
m1.robust <- coeftest(m1.nonrobust,vcov = vcovHC)
# clustered
m1.clu <- coeftest(m1.nonrobust, vcov = vcovCL, cluster = ~grps)

# Function to extract standard errors
get_se <- function(model) {
  if (class(model)[1] == "lm") {
    return(sqrt(diag(vcov(model))))
  } else {
    return(model[, "Std. Error"])
  }
}

# Create a data frame with all standard errors
se_comparison <- data.frame( 
  Coef = m1.nonrobust$coefficients,
  Non_Robust = get_se(m1.nonrobust),
  Robust = get_se(m1.robust),
  Clustered = get_se(m1.clu)
)

# Round all numeric columns to 4 decimal places
se_comparison <- se_comparison %>%
  mutate(across(where(is.numeric), ~round(., 4)))

# Display the table
se_comparison

               Coef Non_Robust Robust Clustered
(Intercept)  0.9445     0.0541 0.0482    0.2640
x1          -3.9659     0.0361 0.0371    0.0524
x2           2.0216     0.0181 0.0189    0.0456

Traditional Bootstrap

First, I am going to show a traditional bootstrap. I am going to create a bootstrap function that can do either full-sample sampling or cluster-sampling. Cluster-sampling is best if we believe standard errors have cluster-dependence; else, if independent, then full-sample sampling is best.

df <- data.frame(y1=y1,
                 x1=x1,
                 x2=x2,
                 grps = grps) %>%
  mutate(id = row_number())

rm(y1,x1,x2,grps)

##### Traditional Bootstrap function
f.traditional.bootstrap <- function(arg.data,arg.model.formula, 
                                    arg.boots = 10,arg.cluster = NULL) {
  ############################################################
  # get cluster variable title
  arg.cluster_quo <- enquo(arg.cluster)
  
  # do this only once but only if clustered
  if (!(is.null(substitute(arg.cluster))==T)) {
    unique_clusters <- arg.data %>% distinct(!!arg.cluster_quo) 
    n_clusters <- nrow(unique_clusters)
  }
  
  # Initialize results matrix
  results <- matrix(NA,
                    nrow = length(coef(lm(arg.model.formula, data = arg.data))), 
                    ncol = arg.boots)
  
  # Perform bootstrap
  for (i in 1:arg.boots) {
    if (is.null(substitute(arg.cluster))==T) {
      # Full-Sample (Independent) Sampling
      # --> sample obs with replacement
      df.boot <- arg.data %>%
        ungroup() %>%
        slice_sample(n=nrow(.), replace = T)
    } 
    else {
      # Cluster-sampling (Dependent) 
      # --> Sample clusters with replacement
      sampled_clusters <- unique_clusters %>%
        slice_sample(n = n_clusters, replace = TRUE)
      
      # Join the sampled clusters back to the original data
      df.boot <- arg.data %>%
        right_join(sampled_clusters, 
                   by = as.character(quo_name(arg.cluster_quo)),
                   relationship = "many-to-many")
    }
    # store OLS coefs for each iteration
    results[,i] <- as.numeric(lm(formula = arg.model.formula,data=df.boot)$coefficients)
  }
  
  # Calculate bootstrap standard errors via sd
  boot_se <- apply(results, 1, sd)
  # 
  return(boot_se)
  ############################################################
}

# 
se_comparison$trad.ind <- f.traditional.bootstrap(df, as.formula("y1 ~ x1 + x2"), arg.boots = 500)
se_comparison$trad.clu <- f.traditional.bootstrap(df, as.formula("y1 ~ x1 + x2"), arg.boots = 500, arg.cluster=grps)

# Round all numeric columns to 4 decimal places
se_comparison <- se_comparison %>%
  mutate(across(where(is.numeric), ~round(., 4)))

# Display the table
se_comparison

               Coef Non_Robust Robust Clustered trad.ind trad.clu
(Intercept)  0.9445     0.0541 0.0482    0.2640   0.0497   0.2598
x1          -3.9659     0.0361 0.0371    0.0524   0.0364   0.0509
x2           2.0216     0.0181 0.0189    0.0456   0.0190   0.0447

Bayesian Bootstrap

Random Weights via Dirichlet distribution

Next, I want to show how to do the Bayesian Bootstrap. As discussed, the weights need to be drawn from a Dirichlet distribution. We can draw \(x_{i}\sim \text{Dirichlet}(\vec{\alpha})\) by drawing \(y_{i} \sim \text{Gamma}(\text{shape}=\vec{\alpha}, \text{rate}=1)\) and then calculating \(x_{i} = y_{i} / \sum_{j\in N} y_{j}\). Wikipedia says it can also be done assuming \(\vec{\alpha}=\vec{1}\) using the uniform distribution, which it calls ‘Stick Breaking.’ Both are below.

# Function to sample from Dirichlet distribution using base R
rdirichlet_base <- function(num.vars, alpha) {
  K <- length(alpha)
  x <- matrix(rgamma(num.vars * K, shape = alpha, rate = 1),
              nrow = num.vars, byrow = TRUE)
  x_sum <- rowSums(x)
  return(as.numeric((x / x_sum)*K))
}

# Function to sample from Dirichlet(1,...,1) using the stick-breaking method
rdirichlet_stick_breaking <- function(num.vars, num.draws) {
  #
  samples <- matrix(nrow = num.vars, ncol = num.draws)
  #
  for (i in 1:num.vars) {
    # Generate K-1 uniform random numbers
    # i <- 1
    u <- runif(num.draws-1)

    # Add 0 and 1, then sort
    breaks <- sort(c(0, u, 1))

    # Calculate differences
    samples[i,] <- diff(breaks)
  }
  #
  return(as.numeric(samples*num.draws))
}

# Set parameters
# Draw samples
# Draw the sample 
sample.1 <- as.numeric(rdirichlet_base(num.vars = 1, alpha = rep(1,100)))
sample.2 <- as.numeric(rdirichlet_stick_breaking(num.vars = 1, num.draws=100))
sample.3 <- as.numeric(rdirichlet_base(num.vars = 1, alpha = sample(1:10,100,replace=T)))

plot(sort(sample.1))
lines(sort(sample.2), col="red")
lines(sort(sample.3), col="blue")

hist(sample.1)

hist(sample.3)

Bayesian Bootstrap

##### Bayesian Bootstrap
f.bayesian.bootstrap <- function(arg.data,arg.model.formula,arg.boots = 10,arg.cluster = NULL) {
  ############################################################
  # 
  arg.cluster_quo <- enquo(arg.cluster)
  
  # Initialize results matrix
  results <- matrix(NA,
                    nrow = length(coef(lm(arg.model.formula, data = arg.data))), 
                    ncol = arg.boots)
  
  # Perform bootstrap
  for (i in 1:arg.boots) {
    if (is.null(substitute(arg.cluster))==T) {
      # For robust standard errors, each observation gets its own weight
      arg.data <- arg.data %>%
        ungroup() %>%
        mutate(wts = rgamma(n(), shape = 1, rate = 1),
               wts = wts / sum(wts))
    } 
    else {
      # For cluster-robust, each cluster gets a weight
      arg.data <- arg.data %>%
        ungroup() %>%
        distinct(!!arg.cluster_quo) %>%
        mutate(uu = rgamma(n(), shape = 1, rate = 1),
               wts = uu / sum(uu)) %>%
        select(!!arg.cluster_quo, wts) %>%
        full_join(arg.data, by = as.character(quo_name(arg.cluster_quo)))
    }
    results[,i] <- as.numeric(lm(formula = arg.model.formula,data=arg.data,weights = wts)$coefficients)
    # drop the wts (only matters for cluster given approach)
    arg.data <- arg.data %>%
      select(-wts)
  }
  
  # Calculate bootstrap standard errors
  boot_se <- apply(results, 1, sd)
  return(boot_se)
}

# 
se_comparison$bayes.ind <- f.bayesian.bootstrap(df, as.formula("y1 ~ x1 + x2"), arg.boots = 500)
se_comparison$bayes.clu <- f.bayesian.bootstrap(df, as.formula("y1 ~ x1 + x2"), arg.boots = 500, arg.cluster=grps)

# Round all numeric columns to 4 decimal places
se_comparison <- se_comparison %>%
  mutate(across(where(is.numeric), ~round(., 4)))

# Display the table
se_comparison

               Coef Non_Robust Robust Clustered trad.ind trad.clu bayes.ind
(Intercept)  0.9445     0.0541 0.0482    0.2640   0.0497   0.2598    0.0485
x1          -3.9659     0.0361 0.0371    0.0524   0.0364   0.0509    0.0375
x2           2.0216     0.0181 0.0189    0.0456   0.0190   0.0447    0.0195
            bayes.clu
(Intercept)    0.2627
x1             0.0498
x2             0.0401