Over at the Development Impact Blog, Berk warns researchers to “beware of studies with a small number of clusters” (link), and raises the worry that we really don’t have good tools for assessing power for cluster randomized studies when the number of clusters is rather small. Berk’s general message is absolutely right. I agree that the available tools are imperfect. Nonetheless, there are better and worse ways to go about thinking through power for a cluster randomized study. Below are some thoughts on this based on my understanding (please correct or comment below!):

First, let’s lay down some fundamental concepts. Getting correct confidence intervals is hard even in the simplest scenarios—e.g., without even considering clustering. Let’s look under the hood for a second to see what is going on in a simple case. The “correct” confidence interval for an experimental treatment effect depends on a few things: the estimand, the population that is the target for your inferences, and the type of variability you are trying to measure. For example, maybe the estimand is the difference in means over treatment and control, the target of your inferences is the specific set of subjects over which treatment was randomized, and the type of variability is that which is induced by the random assignment. That is, you are just trying to make inferences about treatment effects for the subjects that were observed as part of the treatment and control groups. This is perhaps the most basic inferential scenario for an experiment, covering experiments with convenience samples or cases where there was no possibility of random sampling from a larger population. As it happens, the variance for the difference in means in this case is easy to express. Unfortunately, the expression is, generally, unidentified, as it requires knowledge of potential outcome covariances. So right off the bat, we’re in the land of approximation and we haven’t even yet gotten to the question of a reference distribution to use to do hypothesis tests or construct a confidence interval. The good thing is that we can get a conservative approximation of the variance using the usual (heteroskedastic) sampling distribution for difference in means.

Moving ahead, we know the asymptotic distribution for the difference in means is normal. (Well, this is actually something that was only proved recently by David Freedman in a 2008 paper, but I digress. See here: link). But we’re not in asymptopia. So, why not try to fatten the tails of our reference distribution, say, by applying the result that the finite sample distribution for the difference in means with normal data is t? Our underlying data aren’t normal, but a little tail fattening can’t hurt, right? (Though it may not “help” enough…)

Or wait, maybe some resampling method is available—e.g., a permutation test? Well, this works for testing the “sharp null hypothesis,” but standard permutation test procedures are not valid, generally, for generating confidence intervals otherwise. In fact they are more anti-conservative than the approximate method using the t distribution. So we stick with our kludgy variance approximation and use it against a t distribution that hopefully does enough to correct for the fact that we aren’t in asymptopia. Et voila. It’s ugly, but probably good enough for government work. (Some colleagues and I are actually looking into new permutation methods to see if we can do better that what current methods allow. I’ll update once we figure out how well it works.)

Suppose now that instead of a convenience sample, your sample was drawn as a random sample from the target population. Then, a happy coincidence is that the variance approximation described above is unbiased for the true randomization-plus-sampling variance. We still need a reference distribution; for lack of a better alternative, we could use the same t distribution as a tail-fattened finite sample approximation to the asymptotic distribution. An alternative would be to use a bootstrap. But the validity of the bootstrap procedure depends on how well the sample represents the population. While the bootstrap is unbiased over repeated samples, for any given sample, the expected divergence from representing the target population depends on sample size, alas. For that reason, bootstrap confidence intervals can be larger or smaller than their analytically derived counterparts; the two will tend to agree as the sample size grows larger.

So even in these simple cases, without clustering even entering the picture, we’ve got approximations going on all over the place. These approximations are what we feed into our power analysis calculators, hoping for the best.

Clustering doesn’t fundamentally change the situation. We just need to take a few extra details into account. First, the relevant type of variability now has to take into account randomization plus whether we have sampling of clusters, sampling within clusters, or both. In a manner that is similar to what we saw above, conventional (heteroskedastic, cluster-robust) variance expressions coincide with the type of variability associated with randomization plus sampling of and within clusters, making it a conservative approximation to the variance when one or the other kind of sampling is not relevant.

Second, we need to appreciate that the effective sample size from a cluster randomized study is somewhere between the number of units of observation and the number of clusters. The location of the effective sample size between these bookends is a function of intra-class correlation of the outcomes. Working with the exact expression is cumbersome, and so most power calculation packages use a kludge that assumes constant intraclass correlation over treatment and control outcomes (a conservative way to do this is to assume it is equal to the larger of the two). Under random sampling of and within clusters, this gives rise to the classic design effect expression, 1+(m-1)rho, where m is the average cluster size and rho is the intra-class correlation. (This is what Stata’s sampclus applies.) However, while sampclus accounts for the higher variance with the design effect, it does not change the reference distribution. For few clusters, this is probably anti-conservative: the fattening of the tails of our reference distribution ought to take into account these issues of effective sample size. Stata’s kludge is to fatten tails using t(C-1), where C is the number of clusters. (Cameron, Gelbach, and Miller (2008; see here or my Quant II files linked at top right) consider t(C-k) with k being the number of regressors; if you look at their Table 4, this kludge actually performs as well as any of the bootstrap procedures, even with only 10 clusters.) So ideally, you’d want to adjust sampclus results on that basis too (a point that Berk makes in his blog post). It should be an easy thing to program for anyone enterprising and with some time on his or her hands, or you can fudge it by tinkering with the alpha and power levels. I think Optimal Design accounts for degrees of freedom in a better way while using the same design effect adjustment, but as far as I know, the results are only defined for balanced designs. With balanced designs, many of these problems actually disappear completely (see my 2012 paper with Peter Aronow in SPL: PDF), but results for balanced designs are often anti-conservative for non-balanced designs.

The same logic as above would also apply when considering the bootstrap with clustered data: its accuracy depends on how well the sample of clusters represents the target population of clusters. So, as with the above, bootstrap confidence intervals may be larger or smaller than the analytically derived ones. Berk mentioned the wild bootstrap procedure discussed in Cameron, Gelbach and Miller (2008), implemented in a Stata .ado file from Judson Caskey’s website (link). If you play with the example data from the helpfile of that .ado, you will find that wild bootstrap confidence intervals are indeed narrower in some cases and for some variables than what you get from “reg …, cluster(…).”

One could test all of these ideas via simulation, and I think this is an underused complement to analytical power analyses implemented by, e.g., sampsi and sampclus. That is, if you have data from a study or census that resembles what you expect to encounter in your planned field experiment, you should simulate the sampling and treatment assignment and assess the true variability of effect estimates over repeated samples and assignments, adjusting the sample sizes. You can also assess the validity of analytical or bootstrap alternatives against the simulated benchmark.

(For those interested in more details on variance estimation and inference, see the syllabus, slides, and readings for my Quant II and Quant Field Methods courses linked at top right.)

Cyrus, your post’s very interesting & thoughtful. I like your closing suggestions, e.g. “You can also assess the validity of analytical or bootstrap alternatives against the simulated benchmark.” Coincidentally, this week I’ve been writing an example using simulation as a first check on the coverage of sandwich-based confidence intervals (not in a clustered or power context, but for the empirical example in my resubmit), and I’d be interested in suggestions for improvement when I have it drafted.

I’m only a dilettante re the cluster and bootstrap literatures, but I have a few comments.

You make a good point about limitations of bootstrapping. As an extreme scenario, suppose the outcome is Medicaid expenditures, we randomly assign 100 individuals to treatment and 100 to control, and there’s one outlier at $1,000,000 in the control group. Bootstrapping will never give us a sample where the outlier’s in the treatment group, but permutation methods will. That may be an advantage of the new permutation methods you’re looking into.

But under certain assumptions (e.g. in Horowitz, “The bootstrap”, Handbook of Econometrics vol. 5), bootstrap-t confidence intervals achieve an asymptotic refinement over CIs derived from the sandwich SE and the standard normal distribution. I don’t know if anyone has shown this in a finite-population randomization setting, but my guess is that some analog of the i.i.d. results would hold. This suggests that when the potential outcomes don’t have outliers or heavy tails, there’ll often be some middle range of sample sizes where bootstrap-t is more robust than just multiplying the sandwich SE by a critical value.

If one wants to use a sandwich SE with a t distribution instead of the normal, something like Welch’s unequal-variances two-sample t-test might be better than t(C-k). But with a very skewed outcome, bootstrap-t may outperform Welch.

There are other adjustments proposed in a Kauermann-Carroll paper (cited in Mostly Harmless Econometrics), but I don’t know the details.

On your finding that “wild bootstrap confidence intervals are indeed narrower in some cases and for some variables”, is there any evidence that they undercover?

The literature’s focus on the wild bootstrap seems to come from results that say it’s superior to the nonparametric bootstrap when the regression model correctly specifies E(Y|X). Kline and Santos have a simulation where the nonparametric bootstrap performs best under misspecification (but the wild bootstrap still does better than the normal approximation). This may be relevant to covariate adjustment in experiments.

Thanks, Winston. Great feedback as always. Some reactions:

Cyrus, I agree with your bullets and I’ll just add some more background here.

On t(C-k) vs. a Welch-like correction:

t(C-k) mimics t(n-k), which gives the right amount of tail fattening for the conventional SE when there’s no clustering and the normal, homoskedastic linear regression model is true. But sandwich SEs have more variability than conventional SEs, so they need more tail fattening. Welch gives approximately the right amount of tail fattening for the HC2 sandwich (Neyman) SE with a normally distributed outcome. A lower bound for the Welch df is min(n1-1, n2-1), which is sometimes used as a conservative approximation. An upper bound is n1 + n2 – 2. Thus, for a difference-in-means analysis without clustering, Welch is more conservative than using t(n-k) with the HC2 SE.

Stonehouse & Forrester wrote a very nice, little-known simulation study of Welch and other tests (drawing independent samples from two populations). (Stonehouse was an entomologist.)

On bootstrapping and convenience samples:

I may have been going out on a limb on studentized bootstraps and asymptotic refinement, but I’ll sketch an argument to justify the (non-studentized) nonparametric bootstrap SE under random assignment of a large convenience sample. (Of course our discussion is about not-so-large samples, but I want to argue that our justifications for sandwich SEs apply equally well to nonparametric bootstrap SEs, so superpopulation inference isn’t needed to justify the bootstrap.)

The sandwich SE is consistent or asymptotically conservative in this setting, from Freedman (2008, Advances in Applied Math) on the Neyman SE in the difference-in-means case, and from my paper in the OLS-adjusted case (with or without interactions). These papers also show consistency and asymptotic normality of the point estimates of average treatment effect. So as n goes to infinity, the coverage probability of a nominal 95% confidence interval based on the sandwich SE converges to a limit greater than or equal to 95%.

The same result should hold for the nonparametric bootstrap SE, because it’s first-order asymptotically equivalent to the sandwich SE.

Relationships between bootstrap, jackknife, and sandwich SEs are noted in the books by Efron, Efron & Tibshirani, and Davidson & Hinkley and in Lancaster’s paper. But here’s a simple argument:

Mostly Harmless Econometrics derives the asymptotic covariance matrix of OLS (formula 3.1.7, p. 45) assuming the data vector’s i.i.d. but without assuming the regression model’s true. They don’t state the regularity conditions, but it’s sufficient to assume the population has finite 4th moments and a design matrix of full rank, as stated in Freedman, “Bootstrapping regression models”, p. 1219 (Annals of Stats, 1981) and Chamberlain, “Multivariate regression models for panel data”, pp. 17-19 (J. of Econometrics, 1982).

In nonparametric bootstrapping, the resampled data vector (y

, x) is i.i.d., the population has finite 4th moments (since y* and x* are bounded by the original sample extrema), and the design matrix has full rank in the population (if it did in the original sample). Therefore, if both the sample size and the no. of bootstrap replications are large enough, the following are approximately equal:(1) the nonparametric bootstrap estimate of the covariance matrix of OLS

(2) the covariance matrix of the bootstrapped OLS estimate, conditional on the original data [i.e., (1) with an infinite no. of bootstrap replications]

(3) MHE’s formula 3.1.7 applied to the data-generating process for (y

, x)But (3) is just the HC0 sandwich estimator.

(Lancaster thinks results like this are useful for intuition about bootstraps. I think they’re also useful for intuition about conventional and sandwich SEs. E.g., in a blocked experiment, nonparametric bootstrap resampling doesn’t preserve the balance that blocking achieves, so it doesn’t capture the precision gains from blocking [although if we regress on block dummies and treatment x block interactions, those gains can be ignored in large samples, as Miratrix, Sekhon, and Yu show]. The same should be true of the sandwich SE.)