Over at the Development Impact Blog, Berk warns researchers to “beware of studies with a small number of clusters” (link), and raises the worry that we really don’t have good tools for assessing power for cluster randomized studies when the number of clusters is rather small. Berk’s general message is absolutely right. I agree that the available tools are imperfect. Nonetheless, there are better and worse ways to go about thinking through power for a cluster randomized study. Below are some thoughts on this based on my understanding (please correct or comment below!):
First, let’s lay down some fundamental concepts. Getting correct confidence intervals is hard even in the simplest scenarios—e.g., without even considering clustering. Let’s look under the hood for a second to see what is going on in a simple case. The “correct” confidence interval for an experimental treatment effect depends on a few things: the estimand, the population that is the target for your inferences, and the type of variability you are trying to measure. For example, maybe the estimand is the difference in means over treatment and control, the target of your inferences is the specific set of subjects over which treatment was randomized, and the type of variability is that which is induced by the random assignment. That is, you are just trying to make inferences about treatment effects for the subjects that were observed as part of the treatment and control groups. This is perhaps the most basic inferential scenario for an experiment, covering experiments with convenience samples or cases where there was no possibility of random sampling from a larger population. As it happens, the variance for the difference in means in this case is easy to express. Unfortunately, the expression is, generally, unidentified, as it requires knowledge of potential outcome covariances. So right off the bat, we’re in the land of approximation and we haven’t even yet gotten to the question of a reference distribution to use to do hypothesis tests or construct a confidence interval. The good thing is that we can get a conservative approximation of the variance using the usual (heteroskedastic) sampling distribution for difference in means.
Moving ahead, we know the asymptotic distribution for the difference in means is normal. (Well, this is actually something that was only proved recently by David Freedman in a 2008 paper, but I digress. See here: link). But we’re not in asymptopia. So, why not try to fatten the tails of our reference distribution, say, by applying the result that the finite sample distribution for the difference in means with normal data is t? Our underlying data aren’t normal, but a little tail fattening can’t hurt, right? (Though it may not “help” enough…)
Or wait, maybe some resampling method is available—e.g., a permutation test? Well, this works for testing the “sharp null hypothesis,” but standard permutation test procedures are not valid, generally, for generating confidence intervals otherwise. In fact they are more anti-conservative than the approximate method using the t distribution. So we stick with our kludgy variance approximation and use it against a t distribution that hopefully does enough to correct for the fact that we aren’t in asymptopia. Et voila. It’s ugly, but probably good enough for government work. (Some colleagues and I are actually looking into new permutation methods to see if we can do better that what current methods allow. I’ll update once we figure out how well it works.)
Suppose now that instead of a convenience sample, your sample was drawn as a random sample from the target population. Then, a happy coincidence is that the variance approximation described above is unbiased for the true randomization-plus-sampling variance. We still need a reference distribution; for lack of a better alternative, we could use the same t distribution as a tail-fattened finite sample approximation to the asymptotic distribution. An alternative would be to use a bootstrap. But the validity of the bootstrap procedure depends on how well the sample represents the population. While the bootstrap is unbiased over repeated samples, for any given sample, the expected divergence from representing the target population depends on sample size, alas. For that reason, bootstrap confidence intervals can be larger or smaller than their analytically derived counterparts; the two will tend to agree as the sample size grows larger.
So even in these simple cases, without clustering even entering the picture, we’ve got approximations going on all over the place. These approximations are what we feed into our power analysis calculators, hoping for the best.
Clustering doesn’t fundamentally change the situation. We just need to take a few extra details into account. First, the relevant type of variability now has to take into account randomization plus whether we have sampling of clusters, sampling within clusters, or both. In a manner that is similar to what we saw above, conventional (heteroskedastic, cluster-robust) variance expressions coincide with the type of variability associated with randomization plus sampling of and within clusters, making it a conservative approximation to the variance when one or the other kind of sampling is not relevant.
Second, we need to appreciate that the effective sample size from a cluster randomized study is somewhere between the number of units of observation and the number of clusters. The location of the effective sample size between these bookends is a function of intra-class correlation of the outcomes. Working with the exact expression is cumbersome, and so most power calculation packages use a kludge that assumes constant intraclass correlation over treatment and control outcomes (a conservative way to do this is to assume it is equal to the larger of the two). Under random sampling of and within clusters, this gives rise to the classic design effect expression, 1+(m-1)rho, where m is the average cluster size and rho is the intra-class correlation. (This is what Stata’s sampclus applies.) However, while sampclus accounts for the higher variance with the design effect, it does not change the reference distribution. For few clusters, this is probably anti-conservative: the fattening of the tails of our reference distribution ought to take into account these issues of effective sample size. Stata’s kludge is to fatten tails using t(C-1), where C is the number of clusters. (Cameron, Gelbach, and Miller (2008; see here or my Quant II files linked at top right) consider t(C-k) with k being the number of regressors; if you look at their Table 4, this kludge actually performs as well as any of the bootstrap procedures, even with only 10 clusters.) So ideally, you’d want to adjust sampclus results on that basis too (a point that Berk makes in his blog post). It should be an easy thing to program for anyone enterprising and with some time on his or her hands, or you can fudge it by tinkering with the alpha and power levels. I think Optimal Design accounts for degrees of freedom in a better way while using the same design effect adjustment, but as far as I know, the results are only defined for balanced designs. With balanced designs, many of these problems actually disappear completely (see my 2012 paper with Peter Aronow in SPL: PDF), but results for balanced designs are often anti-conservative for non-balanced designs.
The same logic as above would also apply when considering the bootstrap with clustered data: its accuracy depends on how well the sample of clusters represents the target population of clusters. So, as with the above, bootstrap confidence intervals may be larger or smaller than the analytically derived ones. Berk mentioned the wild bootstrap procedure discussed in Cameron, Gelbach and Miller (2008), implemented in a Stata .ado file from Judson Caskey’s website (link). If you play with the example data from the helpfile of that .ado, you will find that wild bootstrap confidence intervals are indeed narrower in some cases and for some variables than what you get from “reg …, cluster(…).”
One could test all of these ideas via simulation, and I think this is an underused complement to analytical power analyses implemented by, e.g., sampsi and sampclus. That is, if you have data from a study or census that resembles what you expect to encounter in your planned field experiment, you should simulate the sampling and treatment assignment and assess the true variability of effect estimates over repeated samples and assignments, adjusting the sample sizes. You can also assess the validity of analytical or bootstrap alternatives against the simulated benchmark.
(For those interested in more details on variance estimation and inference, see the syllabus, slides, and readings for my Quant II and Quant Field Methods courses linked at top right.)