At the International Initiative for Impact Evaluation (3ie) conference in Cuernavaca this past week (link), Guido Imbens from Harvard University gave two lectures on standards for the design and analysis of randomized control trials (RCTs). I thought it was worth a rather lengthy post to describe what he covered. These are very important insights, and only a portion of what he discussed is posted to the web. The first lecture drew mostly from his book manuscript with Donald Rubin, while a paper associated with the second lecture is here (
link updated link). The summary below is based on my understanding of his lectures.
Imbens’s first lecture focused on analyzing data from RCTs using randomization-based methods. The standard analysis that Imbens proposes includes (1) a Fisher-type permutation test of the sharp null hypothesis—what Imbens referred to as “testing”—along with a (2) Neyman-type point estimate of the sample average treatment effect and confidence interval—what Imbens referred to as “estimation.” (See these past posts for some discussion of related points: Fisher style testing and Neyman style estimation.) Interestingly, and in a way that runs contrary to Rosenbaum’s proposed method of analysis (link), Imbens claimed that testing and estimation are separate enterprises with separate goals and that the two should not be confused. I took it as a warning against proposals that use “inverted” tests in order to produce point estimates and confidence intervals. There is no reason that such confidence intervals will have accurate coverage except under rather dire assumptions, meaning that they are not “confidence intervals” in the way that we usually think of them. (This is a point that Peter Aronow and I have developed formally in a paper that hopefully will be published soon.) Thus, Imbens’s suggestion was that a rank-based test was a good choice for null hypothesis testing, owing to its insensitivity to outliers and relative power (i.e., its Pittman efficiency), but that estimation should be based on sample theoretic (Neyman-type) principles. In most practical cases, ordinary least squares (OLS) regression with robust standard errors produces estimates that are, in fact, justified on sample theoretic grounds (even if “randomization does not justify regression assumptions”, as Freedman famously noted).
Imbens devoted a good amount of time in the first lecture to special challenges in analyzing cluster randomized experiments. Cluster randomized trials present analysts with a choice between estimating a treatment effect defined at the level of individual units versus a treatment effect defined at the cluster level. Typically it is the former (unit-level) treatment effect that interests us. However, unbiased estimation is complicated by the fact that cluster-level assignment, combined with variation in cluster sizes, implies variation in unit-level treatment assignment propensities. (NB: What matters is the size of the cluster in the population, not the size of the sample from the cluster. Only under extremely artificial circumstances would cluster sizes ever be equal, in which case these problems are pretty much always relevant in cluster randomized trials.) One may use weights to account for these differences in assignment propensities, but this introduces other issues: simply weighting by relative cluster size introduces a scale invariance problem. Normalizing the weights removes this problem but introduces a ratio estimation problem that creates finite sample bias. (The ratio estimation problem arises because the normalization depends on the average cluster sizes in the treated and control groups, which is random.) However, the bias is typically small in moderate or large samples. The cluster level analysis has none of these problems, since aggregating to the cluster level results in a simple randomized experiment. Given these issues, Imbens’s recommendation was to do both a cluster-level analysis and unit-level analysis, with testing and power analysis focused on the cluster level analysis, and estimation carried out at both levels, using normalized weights for the unit-level analysis. (The issue has also received a nice, thorough treatment by Aronow and Middleton in a paper presented at MPSA last spring—link. They come up with other randomization-based strategies for analyzing cluster randomized trials.)
The second lecture was on design considerations, focusing primarily on (1) optimal design for cluster randomized trials and (2) approaches to “re-randomization” and generalized pre-treatment covariate balancing. On the design of cluster randomized trials, Imbens reviewed an ongoing debate in the statistical literature, in which (1) Snedecor/Cochran and Box/Hunter/Hunter have proposed that in cluster RCTs with small or moderate numbers of clusters, we might have good reason to avoid stratification or pair matching, while (2) Donner/Klar have proposed that some stratification is almost always useful but that pair matching creates too many analytic problems to be worth it, while most recently, (3) Imai/King/Nall have proposed that pair-matching should always be done when possible. Imbens ultimately comes out on the side of Donner/Klar. His take is that the very existence of the debate is based largely on the confusion between testing and estimation: stratification (with pair matching as the limit) always makes estimation more efficient, but under some situations may result in reduced power for testing. This may seem paradoxical, but remember that that power is a function of how a design affects the variability in point estimates (efficiency, or ) relative to the effects of the design on the variability of the estimates of the variance of point estimates (that is, ). This is apparent when you think of power in terms of the probability of rejecting the null using a t-statistic, for which the point estimate is in the numerator and square root of the variance estimate is in the denominator. Noise in both the numerator and denominator affect power. Even though stratification will always reduce the variability of the point estimate, in some cases the estimate of the variance of this point estimate can become unstable, undermining power.
That being the case, clear and tractable instances under which stratification leads to a loss in power arise only in very rigidly defined scenarios. Imbens demonstrated a case with constant treatment effects, homoskedasticity, all normal data, and an uninformative covariate in which a t-test loses power from stratification. But loosening any of these conditions led to ambiguous results, as did the replacement of the t-test with permutation tests. These pathological cases are not a reliable basis for selecting a design. Rather, a more reliable basis is the fact that stratification always improves efficiency. The upshot is that some pre-treatment stratification (i.e., blocking) is always recommended, and this is true whether or not the trial is cluster randomized or unit randomized.
The question then becomes, how much stratification? Here, Imbens disagreed with Imai/King/Nall by proposing that pair-matching introduces more analytical problems than its worth relative to lower-order stratification. The problem with pair matching is that the within-pair variance for the estimated treatment effect is unidentified (you need at least two treated and two control units to identify the estimated treatment effect variance). Thus, one must use the least upper bound identified by the data, which is driven by the between-pair variance. This leads to an overconservative estimator that tosses out efficiency gains from pairing. It also prevents examination of heterogeneity that may be of interest. Thus, Imbens’s recommendation was stratification up to at least two-treated and two-control units or clusters per stratum.
(I asked Imbens how he would propose to carry out the stratification, especially when data are of mixed types and sparseness was a potential problem. His recommendation was dimension reduction. Thus, one could use predicted baseline values from a rich covariate model, clustering algorithms constrained by baseline values, or slightly more flexible clustering. The goal is to reduce the stratification problem to one or a few manageable dimensions.)
The lecture on design also covered some new (and rather inchoate) ideas on “re-randomization” and generalized covariate balancing. The problem that he was addressing was the situation where you perform a randomization, but then note problematic imbalances in the resulting experiment. Should you rerandomize? If so, what should be the principles to guide you? Imbens’s take was that, yes, you should rerandomize, but that it should be done in a systematic manner. The process should (1) define, a priori, rejection rules for assignments, (2) construct the set of all possible randomizations (or an arbitrarily large sample of all possible randomizations), (3) apply the rejection rule, and then (4) perform the actual treatment assignment by randomizing within this set of acceptable randomizations. This type of “restricted randomization” typically uses covariate balance criteria to form the rejection rules; as such, it generalizes the covariate balancing objective that typically motivates stratification.
Some complications arise however. There is no clear basis on which one should define a minimum balance criterion. One could seek to maximize the balance that would obtain in an experiment. Doing so, one would effectively partition the sample into two groups that are maximally balanced, and randomly select one of the two partitions to be the treatment group. But in this case, the experiment would reduce to a cluster randomized experiment on only two clusters! There would be a strong external validity cost, and testing based on permutation would be degenerate. I asked Imbens about this, and his recommendation was that the rejection rules used to perform restricted randomization should maintain a “reasonable” degree of randomness in the assignment process. I think the right solution here is still an open question.
Imbens’s lectures are part of 3ie’s efforts toward standardization in randomized impact evaluations—standardization that greatly facilitates systematic reviews and meta-analyses of interventions, which is another of 3ie’s core activities (link). For those involved in RCTs and field experiments in development, I highly recommend that you engage 3ie’s work.