Imbens and Kolesar have a new NBER working paper, “Robust standard errors in small samples: Some practical advice”: gated link. They propose a new “default” choice for standard errors and confidence intervals to accompany OLS estimates of treatment effects, based on recommendations of Bell and McCaffrey (2002): ungated link. Their ideas are consistent with a previous blog post here, and actually discuss explicitly some of the points that were raised in the comments: link (see especially the comments section). In fact, many of the points that Imbens and Kolesar make are anticipated in a recent paper by Lin, who discusses the performance of OLS-based methods in estimating treatment effects in randomized experiments: ungated link. (See also the discussion of the paper at the Development Impact Blog: link)

The “robust inference for treatment effects” problem can be broken down into two components: (i) getting good standard errors for your point estimates of the treatment effect and (ii) relating these to a reasonable approximation of the sampling (or randomization) distribution of the point estimate to construct hypothesis tests or confidence intervals. Standard practice today for robust inference is captured by Stata’s default under the “, robust” option (or, for cluster-robust, the “, cluster(id)” option). For the non-clustered inference scenario, these defaults use (i) White’s heteroskedasticity consistent covariance estimator to obtain the standard errors, and then (ii) approximate the sampling distribution of the point estimates using t(n-k), where n is the number of data points and k the number of regressors. While White’s estimator is consistent, in finite samples the bias may be pronounced, a point that White himself developed in MacKinnon and White (1985): ungated link. In that 1985 paper, MacKinnon and White introduced the leverage-adjusted HC2 estimator that removes the finite sample bias. It is precisely this HC2 estimator that Imbens and Kolesar recommend as the default that people should use. Incidentally, Peter and I have a paper from last year that also demonstrates that HC2 provides a robust and conservative approximation to the exact randomization variance for unit-randomized experiments, and we also recommend its use: ungated link.

What about component (ii), an approximation of the sampling (or randomization) distribution? The t(n-k) approximation is from an analogy to the case of homoskedastic normal errors, where t(n-k) is in fact the exact sampling distribution. As discussed in the previous blog post (referenced above), the hope is that this gives an adequate amount of “tail fattening” to the reference distribution for the non-homoskedastic case. However, as Imbens and Kolesar demonstrate in a simple example, this approximation is obviously bad when the treatment variable is highly skewed. Consider a case where we have n1 treated, with n1 very large, and n0 control, with n0 very small. We estimate the treatment effect by regressing outcomes on a constant and treatment dummy, and so the treatment effect estimate is equivalent to taking the difference in treated and control means. Then, the treated mean will be highly precise, with a sampling/randomization variance of about 0. The estimate of the control mean will be very imprecise. The sampling/randomization variance of the treatment effect estimate will be driven almost exclusively by the variability in the control mean. The appropriate degrees of freedom for the approximate sampling/randomization distribution is probably closer to n0-1 rather than n-2 = n1 + n0 – 2. This could make a big difference. This is an old problem, and a classical way to deal with it is to use Welch’s approximation to the degrees of freedom for the sampling distribution (link). Welch’s approximation addresses the problems that arise due to skew. (Lin studies Welch-based approxomations in the simulations in the paper linked above, and he also explained it in the comments to the blog post linked above.) The problem with Welch’s degrees-of-freedom approximation is that it relies on estimates of the conditional error variances, which can be quite imprecise in finite samples. This is where Bell and McCaffrey come in. They propose to use a Welch approximation to the degrees of freedom that assumes homoskedasticity, allowing one to avoid having to plug in estimates of the conditional error variances. Of course the homoskedasticity assumption is typically false, but using it for the degrees of freedom approximation at least partially handles the skew problem without introducing new volatility problems. It stands as a reasonable compromise. Simulation studies in the Bell and McCaffrey paper as well as in the Imbens and Kolsar paper shows that it performs well (even outperforming bootstrap methods, such as the wild-t bootstrap, at least as the latter is conventionally applied).

The two papers develop these ideas for more general regression scenarios, including the clustered inference scenario. As for practice, estimating HC2 is easy (it is already and option with the “, robust” command in Stata, and in R it should be simple to program). I don’t know if the Bell-McCaffrey degrees of freedom approximations are pre-canned, but the expressions are not so complicated.

Winston LinHi Cyrus,

I read the paper while half-listening to the Obama-Romney debate. I think it’s a very valuable and helpfully written contribution, following up on a comment Guido made in footnote 17 of “Better LATE Than Nothing”. (Thanks for your kind mention of my paper, but I only scratched the surface!) It appears to be a preliminary draft (e.g., 4.1 describes the cluster simulation setup but doesn’t discuss the results), so I’m sure the next version will be even more helpful and informative. My comments below are meant not as a critique, but just to share some half-baked thoughts on what to make of their results.

For the goal of one default confidence interval method for OLS, their advice seems wise (and almost certainly an improvement over usual practice).

All their simulations use normally distributed error terms, a best-case scenario for the Welch-related methods. From what little I know of the older literature, my impression is that Welch is somewhat robust to nonnormality but may fail with extremely skewed distributions. One example of skewness is when Y is binary and P(Y=1) is far from 0.5. Some people wouldn’t have a strong interest in that example, since they wouldn’t be using OLS with a binary outcome (e.g., see Guido’s comment on Angrist 2001). But it’s relevant to people who work on randomized experiments or who (like Angrist, you, and me) are occasional or habitual OLS sympathizers even for observational studies with binary outcomes (i.e., we think the form of the left-hand side of the model has been overemphasized and the form of the right-hand side has been underemphasized).

For binary outcomes, the methods recommended by Brown & Li might be a better place to start than Welch. One of the simpler methods is Agresti and Caffo’s, which adds four pseudo-observations (one for each combination of Y = 1 or 0 and T = 1 or 0). But I don’t know this literature well at all and don’t have a suggestion for extending it to include covariates.

In Table 1, the undercoverage of the bootstrap CIs may not be surprising given that there are only 3 observations in the treatment group. It would be interesting to see versions of this simulation where the sample sizes (or at least the treatment group sample size) are a little bigger and the outcome is binary or very skewed. Then the bootstrap might outperform the Welch and BM CIs. (Bootstrapping with HC1 might outperform bootstrapping with HC2 and HC3, as MacKinnon finds. Also, for randomized experiments with covariate adjustment, I’d be interested in trying a two-sample nonparametric bootstrap, i.e. resampling with fixed N0 and N1, as an alternative to the wild bootstrap. But I’m going off on a tangent.)

Again, I think it’s an excellent paper. I’m pretty sure Imbens & Kolesar are familiar with the issues I’ve mentioned, and I wouldn’t expect these to be addressed in the first version of a working paper.

Finally, a couple computing notes:

I had the impression that Stata’s default for “, robust” is to use the HC1 SE with the normal distribution. But my info could be out of date.

There are a few R packages that implement HC2, HC3, etc. I’ve used the vcovHC() method in “sandwich”.