At the International Initiative for Impact Evaluation (3ie) conference in Cuernavaca this past week (link), Guido Imbens from Harvard University gave two lectures on standards for the design and analysis of randomized control trials (RCTs). I thought it was worth a rather lengthy post to describe what he covered. These are very important insights, and only a portion of what he discussed is posted to the web. The first lecture drew mostly from his book manuscript with Donald Rubin, while a paper associated with the second lecture is here (~~link~~ updated link). The summary below is based on my understanding of his lectures.

Imbens’s first lecture focused on analyzing data from RCTs using randomization-based methods. The standard analysis that Imbens proposes includes (1) a Fisher-type permutation test of the sharp null hypothesis—what Imbens referred to as “testing”—along with a (2) Neyman-type point estimate of the sample average treatment effect and confidence interval—what Imbens referred to as “estimation.” (See these past posts for some discussion of related points: Fisher style testing and Neyman style estimation.) Interestingly, and in a way that runs contrary to Rosenbaum’s proposed method of analysis (link), Imbens claimed that testing and estimation are separate enterprises with separate goals and that the two should not be confused. I took it as a warning against proposals that use “inverted” tests in order to produce point estimates and confidence intervals. There is no reason that such confidence intervals will have accurate coverage except under rather dire assumptions, meaning that they are not “confidence intervals” in the way that we usually think of them. (This is a point that Peter Aronow and I have developed formally in a paper that hopefully will be published soon.) Thus, Imbens’s suggestion was that a rank-based test was a good choice for null hypothesis *testing*, owing to its insensitivity to outliers and relative power (i.e., its Pittman efficiency), but that *estimation* should be based on sample theoretic (Neyman-type) principles. In most practical cases, ordinary least squares (OLS) regression with robust standard errors produces estimates that are, in fact, justified on sample theoretic grounds (even if “randomization does not justify regression assumptions”, as Freedman famously noted).

Imbens devoted a good amount of time in the first lecture to special challenges in analyzing cluster randomized experiments. Cluster randomized trials present analysts with a choice between estimating a treatment effect defined at the level of individual units versus a treatment effect defined at the cluster level. Typically it is the former (unit-level) treatment effect that interests us. However, unbiased estimation is complicated by the fact that cluster-level assignment, combined with variation in cluster sizes, implies variation in unit-level treatment assignment propensities. (NB: What matters is the size of the cluster in the *population*, not the size of the sample from the cluster. Only under extremely artificial circumstances would cluster sizes ever be equal, in which case these problems are pretty much *always* relevant in cluster randomized trials.) One may use weights to account for these differences in assignment propensities, but this introduces other issues: simply weighting by relative cluster size introduces a scale invariance problem. Normalizing the weights removes this problem but introduces a ratio estimation problem that creates finite sample bias. (The ratio estimation problem arises because the normalization depends on the average cluster sizes in the treated and control groups, which is random.) However, the bias is typically small in moderate or large samples. The cluster level analysis has none of these problems, since aggregating to the cluster level results in a simple randomized experiment. Given these issues, Imbens’s recommendation was to do both a cluster-level analysis and unit-level analysis, with testing and power analysis focused on the cluster level analysis, and estimation carried out at both levels, using normalized weights for the unit-level analysis. (The issue has also received a nice, thorough treatment by Aronow and Middleton in a paper presented at MPSA last spring—link. They come up with other randomization-based strategies for analyzing cluster randomized trials.)

The second lecture was on design considerations, focusing primarily on (1) optimal design for cluster randomized trials and (2) approaches to “re-randomization” and generalized pre-treatment covariate balancing. On the design of cluster randomized trials, Imbens reviewed an ongoing debate in the statistical literature, in which (1) Snedecor/Cochran and Box/Hunter/Hunter have proposed that in cluster RCTs with small or moderate numbers of clusters, we might have good reason to avoid stratification or pair matching, while (2) Donner/Klar have proposed that some stratification is almost always useful but that pair matching creates too many analytic problems to be worth it, while most recently, (3) Imai/King/Nall have proposed that pair-matching should always be done when possible. Imbens ultimately comes out on the side of Donner/Klar. His take is that the very existence of the debate is based largely on the confusion between testing and estimation: stratification (with pair matching as the limit) *always* makes estimation more efficient, but under some situations may result in reduced power for testing. This may seem paradoxical, but remember that that power is a function of how a design affects the variability in point estimates (efficiency, or $latex V(\hat \beta)$) *relative* to the effects of the design on the variability of the *estimates of the variance of point estimates* (that is, $latex V[\hat V(\hat \beta)]$). This is apparent when you think of power in terms of the probability of rejecting the null using a t-statistic, for which the point estimate is in the numerator and square root of the variance estimate is in the denominator. Noise in both the numerator and denominator affect power. Even though stratification will always reduce the variability of the point estimate, in some cases the *estimate* of the variance of this point estimate can become unstable, undermining power.

That being the case, clear and tractable instances under which stratification leads to a loss in power arise only in very rigidly defined scenarios. Imbens demonstrated a case with constant treatment effects, homoskedasticity, all normal data, and an uninformative covariate in which a t-test loses power from stratification. But loosening any of these conditions led to ambiguous results, as did the replacement of the t-test with permutation tests. These pathological cases are not a reliable basis for selecting a design. Rather, a more reliable basis is the fact that stratification always improves efficiency. The upshot is that some pre-treatment stratification (i.e., blocking) is always recommended, and this is true whether or not the trial is cluster randomized or unit randomized.

The question then becomes, how much stratification? Here, Imbens disagreed with Imai/King/Nall by proposing that pair-matching introduces more analytical problems than its worth relative to lower-order stratification. The problem with pair matching is that the within-pair variance for the estimated treatment effect is unidentified (you need at least two treated and two control units to identify the estimated treatment effect variance). Thus, one must use the least upper bound identified by the data, which is driven by the between-pair variance. This leads to an overconservative estimator that tosses out efficiency gains from pairing. It also prevents examination of heterogeneity that may be of interest. Thus, Imbens’s recommendation was stratification up to at least two-treated and two-control units or clusters per stratum.

(I asked Imbens how he would propose to carry out the stratification, especially when data are of mixed types and sparseness was a potential problem. His recommendation was dimension reduction. Thus, one could use predicted baseline values from a rich covariate model, clustering algorithms constrained by baseline values, or slightly more flexible clustering. The goal is to reduce the stratification problem to one or a few manageable dimensions.)

The lecture on design also covered some new (and rather inchoate) ideas on “re-randomization” and generalized covariate balancing. The problem that he was addressing was the situation where you perform a randomization, but then note problematic imbalances in the resulting experiment. Should you rerandomize? If so, what should be the principles to guide you? Imbens’s take was that, yes, you should rerandomize, but that it should be done in a systematic manner. The process should (1) define, a priori, rejection rules for assignments, (2) construct the set of all possible randomizations (or an arbitrarily large sample of all possible randomizations), (3) apply the rejection rule, and then (4) perform the actual treatment assignment by randomizing within this set of acceptable randomizations. This type of “restricted randomization” typically uses covariate balance criteria to form the rejection rules; as such, it generalizes the covariate balancing objective that typically motivates stratification.

Some complications arise however. There is no clear basis on which one should define a minimum balance criterion. One could seek to maximize the balance that would obtain in an experiment. Doing so, one would effectively partition the sample into two groups that are maximally balanced, and randomly select one of the two partitions to be the treatment group. But in this case, the experiment would reduce to a cluster randomized experiment on only two clusters! There would be a strong external validity cost, and testing based on permutation would be degenerate. I asked Imbens about this, and his recommendation was that the rejection rules used to perform restricted randomization should maintain a “reasonable” degree of randomness in the assignment process. I think the right solution here is still an open question.

Imbens’s lectures are part of 3ie’s efforts toward standardization in randomized impact evaluations—standardization that greatly facilitates systematic reviews and meta-analyses of interventions, which is another of 3ie’s core activities (link). For those involved in RCTs and field experiments in development, I highly recommend that you engage 3ie’s work.

Thanks very much for this post; It’s quite helpful. You said that confidence intervals derived from inverted tests need strong assumptions to gave accurate coverage. Would you mind explaining this a bit or a link to an explanation? Is it a finite sample property? Thanks again.

Thanks for the comment, Peter. It’s not a finite sample property. The idea is this: an inverted hypothesis test “confidence regions” are typically based on an hypothesis of constant effects. But effects are usually heterogenous. Therefore, the constant effects assumption can produce an inverted test confidence region that does not correspond to the sampling distribution of the effect estimator, in which case it will not have coverage properties that allow one to interpret it as a “confidence interval” based on our usual definition. An example is when one uses the difference in means as the test statistic. Then, in large samples, the resulting confidence region converges to the confidence interval that would result from assuming homoskedasticity, and in finite samples would be smaller than the homoskedasticity confidence interval. We know that generally, the homoskedasticity assumption does not produce appropriate confidence intervals.

Thanks so much for the post! I must say that after the two lectures I felt far less confident about using cluster randomized trials.

Thanks for the post and the thoughtfull comments. I have a couple of clarifications regarding my lectures. First, about the distinction between estimation and testing. What I was arguing against was the common practice of testing by estimating the quantity of interest (e.g., the average effect) and then constructing a confidence interval for that. I made the claim that in many cases that may have less power than and rely on additional assumptions beyond those required for randomization based tests (with the additional power coming from using ranks). It puzzles, me and I think ultimately it is just tradition, that in economics people dont use the rank based randomization tests more often. Of course this is not an original thought from me – I learned about the randomization tests largely from Paul Rosenbaum’s work. Paul goes beyond that and likes to use randomization inference also for estimation. I think that is very interesting (I wrote a small paper on that with Paul at some point), but as a general matter I do think that approach has some disadvantages. The main one is that it relies on sharp null hypotheses, and so in practice you get confidence intervals that assume constant treatment effects. With data where that is demonstrably wrong (e.g., the variance in the control and treatment groups differ), the confidence intervals may have very poor properties when viewed as confidence intervals for the average effect.

Re Gareth’s comment: I think in many cases you dont have a choice, you can only randomize at the cluster level. There are interesting challenges in inference, and more work is needed, but the problems are not insurmountable.

Many thanks, Guido (if I may!), for your comment.

Regarding the issue of using randomization tests (whether rank based or not): I typically see randomization tests presented as tests of the “sharp null” hypothesis of no effect. As I understand though, traditionally, our t-tests (and z-tests) are tests of the hypothesis that the parameter, beta, is 0. If beta is the estimated difference in means, then the test that beta=0 is consistent with the sharp null, but it is also consistent with hypotheses that admit unit-specific effects so long as these effects cancel out on average. If the choice were between randomization based tests of the sharp null and traditional t-tests of the hypothesis that the difference in means is zero, then it I am okay with people using the traditional t-test when the latter the hypothesis of “zero average effect” is of interest. Often, this is indeed the hypothesis of policy relevance.

We could in principle construct a randomization-based test of the hypothesis that beta=0. The problem is that the set of hypothetical potential outcome profiles consistent with this possibility is vast (e.g., sharp null for all units but one treated and one control, and then arbitrary offsetting treatment effects for those two, etc.). I guess the test would require making some assumptions about which profiles are more or less likely. I think Rosenbaum mentions this at some point

Observational Studies, but I don’t recall if or how he resolves it . So, I don’t see this as more appealing than the traditional t-test for beta=0.A separate matter is whether it makes sense to use the difference in means or some other statistic to test the hypothesis that effects are zero on average. For example, one could look at the difference in average ranks, using the sampling distribution of ranks (again, I think Rosenbaum explores this) to construct the test. But the difference in average ranks strikes me a quantity that rarely makes intuitive sense for policy makers. So, again, I am left thinking that the t-test is quite appealing.

Cyrus, Let me expand on my comments a bit. First of all I completely agree with you that the null hypothesis of an average effect of zero is substantively more interesting than the sharp null hypothesis of no effect whatsoever. Imagine informing a policy maker that a program has some effect on the outcome of interest, but not being able to tell the policy maker whether the effect is positive or negative, only that there is some effect for at least some individuals.Typically a policy maker would not be interested in such a finding. For that reason I would not advocate doing a randomization based test using the kolmogorov-smirnov statistic (the maximum of the absolute value of the difference in empirical distribution functions between treated and controls). Find that the KS statistic leads to a significant p-value would not be interesting in general.

Why then do I think it is useful to do the randomization test using average ranks as the statistic instead of doing a t-test?.I think rather than being interested very specifically in the question whether the average effect differs from zero, one is typicaly interested in the question whether there is evidence that there is a positive (or negative) effect of the treatment. That is a little vague, but more general than simply a non-zero average effect. If we cant tell whether the average effect differs from zero, but we can be confident that the lower tail of the distribution moves up, that would be informative. I think this vaguer null is well captured by looking at the difference in average ranks: do the treated have higher ranks on average than the controls. I would interpret that as implying that the treated have typically higher outcomes than the controls (not necessarily on average, but typically).

I think this discussion confuses models with test statistics. While they may often appear the same (e.g. we use a difference of means test statistic to summarize our information while testing hypotheses about the population mean), there is no reason that they need be the same. The obvious example is the Wilcox-Mann-Whitney test. The model is a location shift, but the statistic is a difference of ranks.

Reporting a difference of ranks I agree is probably not a useful quantity, but reporting a confidence interval for the parameter of location shift is not the same as reporting the rank statistic.

As Rosenbaum (2010) points out, our choice of test statistic is largely about power against specific forms of alternatives. Differences of means have high power on normal data, but perform poorly compared to ranks. A KS test may perform well on other data or against other alternative hypotheses. Of course, there is also the computational aspect of selecting a test statistic. Some admit to nice closed form solutions or Normal approximations.

Jake Bowers and I are working on tools and recommendations for even more flexible models (e.g. interference over fixed networks). You can see some demonstrations of pairing different models of effect with different test statistics in the test code:

https://github.com/markmfredrickson/RItools/tree/randomization-distribution/tests

Thanks for the comment, Mark. I am surprised to hear it though, because the discussion and comments above try to be very clear about the important

differencebetween models and test statistics (so there should hopefully be no confusion). For example, the suggestion to use “a Fisher-type permutation test of the sharp null hypothesis” is clearly distinguished from “a Neyman-type point estimate of the sample average treatment effect and confidence interval.” In this analysis plan, the model used to estimate the treatment effect is sample theoretic, difference in means based, and due to Neyman. The test statistic used to test the sharp null hypothesis is permutation based, uses average ranks, and is due to Fisher. So, one can read into this that different types of statistics might be useful for different purposes. Difference in means is not necessarily a good starting place for testing the sharp null; average ranks is not necessarily good starting place for estiamting confidence intervals. Or were you referring to something else?Cyrus, thanks for your helpful post and stimulating this discussion. I agree with the gist of your comments and just want to add a thought.

The alternative tests that Guido and Mark suggest do sound useful, but let’s step back and ask why we should be interested in hypothesis tests in the first place. The point I’ll make is probably familiar to everyone; David Aldous puts it pithily in the first 2 paragraphs of his mini-review of Ziliak and McCloskey.

As Aldous says, the main value of a test is “to prevent you from jumping to conclusions based on too little data.” In other words, lack of statistical significance can be a good reason to say “don’t jump up and down,” but significance isn’t a reason to say “jump up and down.” No matter what the p-value is, we need some idea of how big the estimated difference is and how much uncertainty we have about it.

Frequentist confidence intervals for average treatment effects are one attempt to answer that question. (Of course, they only measure one kind of uncertainty.) I agree with Guido and Mark that they miss important questions that the alternative tests get at, and this whole discussion was helpful for me. But a p-value from a rank sum test doesn’t seem useful to me in isolation. If the difference in average ranks isn’t an intuitive metric, then I think we need some other summary (graphical or numerical) and a way to show our uncertainty about that summary (on a policy-relevant scale, not as a p-value). Are there good examples of empirical studies that do this?

Hi Y’all,

Sorry to come in on this discussion so late. I just stumbled across it in one of my rather few forays into the blogosphere. I thought I’d offer one perspective on that has, for me, clarified the distinction between the idea of test statistics or data summaries (like differences of means, or ranks etc..) and hypothesis tests and confidence intervals about causal effects.

Consider this model of a multiplicative, non-constant causal relation:

`\begin{equation} r_{i,c}=r_{i,t}/\tau_i \label{eq:causalmodel} \end{equation}`

Say also that we have only two treatments and SUTVA holds (and this is a simple study, a “bear with me while I try to just make up a simple situation for illustration” kind of study). So,

`\begin{equation} R_i=Z_i r_{i,t}+ (1-Z_i)r_{i,c} \label{eq:obsidentity} \end{equation}`

If we could make the distribution of

`$R_i$`

in treated group look like the distribution in the control group we would know that we have something to say about the causal effect (i.e. we would have learned something about the counterfactual of interest). This model says that we can do this — turn treated into controls — by adjusting observed outcomes like so:`\begin{equation} r^*_{i,c}=R_i/(1-Z_i + \tau_i Z_i ) \label{eq:adjoutcomes} \end{equation}`

Now, we would like to ask which values of

`$\tau_i$`

are conclusions where our data would say “No jumping!”. [I really liked Winston’s discussion about this.] We want an interval, collection of such values. (Here assume a quantum computer testing the vectors of \tau_i. ðŸ™‚ ).Anyway, we could choose a t() that involves summarizing the relation with ranks, or one using mean differences, or something else. But, what we are doing inference about it is

`$\tau$`

; and`$\tau$`

is meaningful because of our structural potential outcomes model.For a while I was very confused and tried to think about the actual values of the rank-based test statistic. But, the fact is that, in this workflow, the test statistic itself need not be directly interpretable. What is important is that it reflect on the underlying causal model via the observational identity linking potential outcomes to observed outcomes and design. For example, consider Rosenbaum’s 2002 style covariance adjustment using ranks (or Ben and my slightly different style covariance adjustment using mean differences from our JASA article). The actual values there are really not that useful (??sums of ranks of residuals??). But the p-values arising from those values are useful. The inferences that they provide about the treatment effect and about the model of potential outcomes are useful.

In our JASA article our t() was a sum, and our causal model of binary outcomes (

`$\tau_i=r_{i,t}-r_{i,c}$`

) very naturally led to a sum as a hypothesis. But this was just because we had binary outcomes and sums of binary outcomes carry as much information as the binary outcomes themselves.We did check the operating characteristics of our procedure (we used a Normal approximation), but this is pretty easy to do these days. Whether or not a confidence interval contains the hypotheses that you think it does is something that is more or less easy to assess for a given dataset, design, and model (whether or not you are doing statistical inference for causal effects defined by potential outcomes). We learned how to do such assessment from work like Imbens and Rosenbaum 2005. So, while it is possible that a given structural model of effects is a bad one, it ought to be evident from such simulations. And the simulations tell you about the power of your test statistic/covariance adjustment strategy anyway. So, one hopes that sophisticated people will not have to rely on arguments from authority about this kind of thing but can ask such questions about their own particular analyses.

So, to summarize what is implicit and explicit in my post:

(1) The test statistic need not be easily substantively interpretable in order to do statistical inference about meaningful causal quantities. (and thus can be chosen with an eye toward statistical power, for example). This doesn’t argue against the use of the average treatment effect. It just ought to clarify confusions about rank based tests, for example.

(2) Constant additive effects are convenient: methodologists use them to make life easy while making other points. Fisher’s style of analysis does not require them. Rosenbaum’s work has examples of confidence intervals for non-constant, non-additive effects (see for example, Chapter 5 of his 2002 book). In political science Ben and my JASA paper is one example where we have additive but non-constant effects because of binary outcomes. We (and also Imbens and Rosenbaum in 2005) also do inference using random assignment as an IV (which is a non-constant kind of constaint on the kind of hypotheses that one may consider). Fisher’s randomization inference is not identical to assessing constant additive effects models of causal effects.

Finally, let me thank all of y’all for your great comments here! Guido’s idea that we should be thinking about how to combine approaches is really a good one, for example, that could be useful even more broadly. (I’ve played around a bit with Bayesian covariance adjustment for randomization based inference on causal effects, for example). I like the idea of graphical summaries and I agree with that rank-based test statistics are hard to interpret on their own (although the average difference in ranks makes the rank test much easier to interpret as Guido points out). Perhaps the main idea ought to be to focus on the causal parameter of interest (perhaps with graphical displays if the model making it meaningful is complicated). Perhaps an analogy would be to the posterior for some quantity of substantive interest on the scale of interest and the posterior actually sampled from (often logged, or scaled, or otherwise made more tractable)? Or even more simply to coefficients from models with non-linear link functions?