Manual Calculation of SEs from Optimization

Notes
Optimx
Standard Errors
Robust
Gradient
Hessian
Sandwich
SEs after Optimx
Author

C. Luke Watson

Published

September 4, 2024

For a project, I plan on calculating standard errors for a Non-linear Least Squares estimation approach, but before I can do that I need a refresher. After some googling and checking my reference books, I was surprised that I could not find a simple yet comprehensive bit of example code for a non-ML, NLS approach. For below, I consulted two sets of slides from A. Colin Cameron (slides1, slides2), as well as F.Hayashi’s book. I also used Claude.ai to debug some errors.

Set up

# clear workspace
rm(list=ls())
gc()
          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  615510 32.9    1253197   67         NA   953310 51.0
Vcells 1109196  8.5    8388608   64     102400  1839961 14.1
# libraries
library(tidyverse)  # general data manipulation
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)      # model results to data.frame for compact viewing
library(optimx)     # optim + extras
library(sandwich)   # Adjust standard errors
library(lmtest)     # Recalculate model errors with sandwich functions with coeftest()
Loading required package: zoo

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
# Set random seed for reproducibility
set.seed(1701)

Create Data

I am adding heteroskedasticity so that I can see the difference between non-robust and robust SEs.

# obs - high so that coeficients are well estimated
n <- 10000

# two x vars (will add a constant)
x1=rnorm(n,1,1)
x2=rnorm(n,1,2)

# heteroskedasticity
sd <- runif(n, min=0.5, max=4)
s.true <- sd*(x1/5 + 1) # need to ensure that all s's are positive
error <- rnorm(n,0,s.true)

# true coefficents
coef.true <- c(1,-4,2)

#Generate the dependent variable
y1 <- coef.true[1] + (coef.true[2]*x1) + (coef.true[3]*x2) + error

Canned OLS

# OLS with canned procedure 
m1 <- lm(y1 ~ x1 + x2)
tidy(m1)
# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     1.01    0.0450      22.4 2.02e-108
2 x1             -3.98    0.0299    -133.  0        
3 x2              1.99    0.0152     131.  0        

Estimating via Optimization

This is really where I need the refresher. I found that most treatments wrote up the theory using general functions, but gave the examples with application specific functional forms. To me, this obscured more than helped. I find that the best thing to do is to directly code the theory to really get what is going on and get the most generalizable (application agnostic) code… plus I think it helps one understand the theory more.

My goal is to replicate the standard errors using as general coding as possible. I think the key to this is by specifying a residual function directly and calling that within the SSR function. Even though I am doing OLS, I never actually use any functional form related to OLS outside of the residual function.

# We are going to code of our sum of squared residuals function
# first, write the residual function 
f.resid <- function(coef.arg){
  b0 <- coef.arg[1]
  b1 <- coef.arg[2]
  b2 <- coef.arg[3]
  u.i <- (y1 - (b0 + b1*x1 + b2*x2))
  return(u.i)
}
# second, write the SSR function
# note: that we divide by 2! 
#       if we do not, then coefs are true but SEs are wrong!
f.ssr <- function(coef.arg){
  # coef.arg <- coef.init
  u2 <- (f.resid(coef.arg))^2
  ssr <- sum(u2)/2
  if(is.na(ssr) | is.nan(ssr)) return(Inf)
  return(ssr)
}

# initial coef values
coef.init <- c(1,1,1)

# run optimx
# note: (1) use BFGS since no constraints, (2) request hessian 
out.reg <- optimx(coef.init,
                  f.ssr,method = c("BFGS"),
                  control = list(maxit=500000,trace=1),
                  hessian=T
)
fn is  fn1 
Looking for method =  BFGS 
Function has  3  arguments
Analytic gradient not made available.
Analytic Hessian not made available.
Scale check -- log parameter ratio= 0   log bounds ratio= NA 
Method:  BFGS 
initial  value 268911.921901 
final  value 45464.216631 
converged
Post processing for method  BFGS 
Successful convergence! 
Compute Hessian approximation at finish of  BFGS 
Compute gradient approximation at finish of  BFGS 
Save results from method  BFGS 
$par
[1]  1.007687 -3.977331  1.991223

$value
[1] 45464.22

$message
NULL

$convcode
[1] 0

$fevals
function 
      31 

$gevals
gradient 
       6 

$nitns
[1] NA

$kkt1
[1] TRUE

$kkt2
[1] TRUE

$xtimes
user.self 
     0.02 

Assemble the answers
# final coefs
coef.fin <- as.numeric(out.reg[1,1:3])
coef.fin
[1]  1.007687 -3.977331  1.991223

Non-robust SEs

Non-robust SEs are the degrees-of-freedom adjusted residual SD scaled squared root of the inverse hessian.

# calculate residuals at final estimate
resid <- f.resid(coef.fin)

# get DoF adjusted SD of resid(b.star)
s2.hat <- sd(resid)*sqrt(((n-1)/(n-3)))
# get hessian from details
hessian <- attributes(out.reg)$details["BFGS", "nhatend"][[1]]
# calculate the non-robust SEs 
se.hessian <- sqrt(diag(base::solve(hessian)))*s2.hat

# compare SEs from canned and manual
tidy(m1)
# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     1.01    0.0450      22.4 2.02e-108
2 x1             -3.98    0.0299    -133.  0        
3 x2              1.99    0.0152     131.  0        
print(round(se.hessian,4))
[1] 0.0450 0.0299 0.0152

Canned Robust SEs

# Robust standard errors with lm()
m1.robust <- coeftest(m1,vcov = vcovHC)
tidy(m1)
# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     1.01    0.0450      22.4 2.02e-108
2 x1             -3.98    0.0299    -133.  0        
3 x2              1.99    0.0152     131.  0        
tidy(m1.robust)
# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     1.01    0.0382      26.3 5.89e-148
2 x1             -3.98    0.0312    -127.  0        
3 x2              1.99    0.0155     129.  0        

Manual Robust SEs

Note, for the robust SEs we need a more intensive estimator. First, we needed the jacobian (vector valued gradient) of the residual function, which is in n-by-k. Second, we need to form the meat using that jacobian and the covariance matrix of the residuals. Third, we need to get the inverse of the covariance of the jacobian. Finally, we make a sandwich.

And that’s it.

# Robust SEs manual need the jacobian (multivalued function gradient) wrt the parameters
# we will do this numerically (although it is analytic for OLS)
resid.grad <- numDeriv::jacobian(func = f.resid, x = coef.fin)

# Robust SEs use the 'sandwich' estimator:
# BREAD * MEAT * BREAD
# MEAT = gT * diag(resid^2) * g
# BREAD = [ gT*g ]^{-1}
# this is a 'plug-in' estimator where we use our estimates to get 
#     sample analogues of the things we want (similar to avg as sample version of mean)

# bc data are independent, the diag version is correct
# if not independent, then we could use the fully dense (resid %*% t(resid))
meat <- t(resid.grad) %*% diag(resid^2) %*% resid.grad

# solve(A) = A^{-1}
# E[gTg] is the covariance of g
bread <- solve(t(resid.grad) %*% resid.grad)

# Calculate the sandwich
sandwich <- bread %*% meat %*% bread

# Calculate robust standard errors
robust_se <- sqrt(diag(sandwich))
# 
tidy(m1.robust)
# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     1.01    0.0382      26.3 5.89e-148
2 x1             -3.98    0.0312    -127.  0        
3 x2              1.99    0.0155     129.  0        
print(round(robust_se,4))
[1] 0.0382 0.0312 0.0154

Future work

I may come back and update this to include clustering.