A post from last week by Jenny Aker at the Savings Revolution blog (link) proposes strategies for rigorous impact assessment when full randomization is not possible. Her third suggestion is a friendly way of saying that regression discontinuity (RD) designs should be used more, not only in analyzing existing interventions, but in designing new ones. If we can use quantifiable indices to determine who qualifies to receive program benefits, and if the indices are used faithfully in actually determining who gets benefits, then we can use the indices to carry out an RD analysis. This should be appealing to practitioners because it provides a transparent and relatively incorruptible method for beneficiary selection and it is sensitive to concerns that those most in need be most eligible for assistance, while minimally compromising our ability to estimate program impacts. As methodologists, I think we need to do more to sell this approach in cases where full randomization is not feasible.
A relatively new paper by Papay et al. in Journal of Econometrics (gated link) demonstrates ways to generalize RD analysis to multiple assignment variables and cutoffs in multiple dimensions. The killer graph from the paper is shown above. In this case, you have treatment assignment based on cutoffs on two variables, labeled as X1 and X2 on the graph (the vertical axis is the outcome variable). Cutoffs in two dimensions create four treatment regions, A, B, C, and D. The analysis proceeds by using a regression to model the response surface in each region. Then, you can obtain predicted values along each of the discontinuity edges. These predictions can be subtracted from each other and aggregated to produce various types of average treatment effects. All of this can happen more or less automatically with a single regression specification, although one should take care to understand the manner in which such a regression “averages” the various available treatment effects (I believe that it produces a covariance-weighted average, rather than a sample weighted average, along the lines of what Angrist and Pischke discuss in Mostly Harmless…).
A colleague and I were discussing tests for the identifying assumptions for RD. It seems that there have been some calls to test for “balance” in covariates around cutpoints to assess whether identifying assumptions are met for RD. The idea of these tests is that in the neighborhood of the cutpoint, covariate distributions should be equal. Balance is thus tested using the permutation distribution under this null hypothesis. To me, this sounds like one is imposing more assumptions than necessary for an RD design. RD requires smoothness in covariates, not balance. The “R” in RD is there for a reason. If balance were a necessity, we should just call it “D”! Covariate means might differ on either side of the cutpoint within arbitrarily small windows, without there being a violation of the smoothness condition. In this case, a balance test would lead one to conclude that identifying conditions are not met when in fact they are (that is, the test would be trigger happy on the type II error rate). The direct test for smoothness is a “placebo” regression of the covariate, where you estimate the existence of a discontinuity (refer to Imbens and Lemieux, gated link). I suppose one could construct a permutation test that also looks for smoothness/discontinuities, but the balance tests on adjusted covariates strikes me as erroneous.
Combining information from surveys that use multiple modes of contact
Over at the Statistical Modeling, Causal Inference, and Social Science blog (link), Andrew Gelman writes,
I’m involved (with Irv Garfinkel and others) in a planned survey of New York City residents. It’s hard to reach people in the city–not everyone will answer their mail or phone, and you can’t send an interviewer door-to-door in a locked apartment building. (I think it violates IRB to have a plan of pushing all the buzzers by the entrance and hoping someone will let you in.) So the plan is to use multiple modes, including phone, in person household, random street intercepts and mail.
The question then is how to combine these samples. My suggested approach is to divide the population into poststrata based on various factors (age, ethnicity, family type, housing type, etc), then to pool responses within each poststratum, then to runs some regressions including postratsta and also indicators for mode, to understand how respondents from different modes differ, after controlling for the demographic/geographic adjustments.
Maybe this has already been done and written up somewhere?
It’s interesting to consider this problem by combining a “finite population” perspective with some ideas about “principal strata” from the causal inference literature. Suppose a finite population U from which we draw a sample of N units. We have two modes of contact, A and B. Suppose for the moment that each unit can be characterized by one of the following response types (these are the “principal strata”):
Type | Mode A response | Mode B response |
---|---|---|
I | 1 | 1 |
II | 1 | 0 |
III | 0 | 1 |
IV | 0 | 0 |
Then, there are two cases to consider, depending on whether mode of contact affects response:
Mode of contact does not affect response
This might be a valid assumption if the questions of interest are not subject to social desirability biases, interviewer effects, etc. In this case, it is easy to define a target parameter as the average response in the population. You could proceed efficiently by first applying mode A to the sample, and then applying mode B to those who did not respond with mode A. At the end, you would have outcomes for types I, II, and III units, and you’d have an estimate of the rate of type IV units in the population. You could content yourself with an estimate for the average response on the type I, II, and III subpopulation. If you wanted to recover an estimate of the average response for the full population (including type IV’s), you would effectively have to impute values for type IV respondents. This could be done by using auxiliary information either to genuinely impute or (in a manner that is pretty much equivalent) to determine which type I, II, or III units resemble the missing type IV units, and up-weight. In any case, if the response of interest has finite support, one could also compute “worst case” (Manski-type) bounds on the average response by imputing maximum and minimum values to type IV units.
Mode of contact affects response
This might be relevant if, for example, the modes of contact are phone call versus face-to-face interview, and outcomes being measured vary depending on whether the respondent feels more or less exposed in the interview situation. This possibility makes things a lot trickier. In this case, each unit is characterized by a response under mode A and another under mode B (that is, two potential outcomes). One immediately faces a quandary of defining the target parameter. Is it the average of responses under the two modes of contact? Maybe it is some “latent” response that is imperfectly revealed under the two modes of contact? If so, how can we characterize this “imperfection”? Furthermore, only for type I individuals will you be able to obtain information on both potential responses. Does it make sense to restrict ourselves to this subpopulation? If not, then we would again face the need for imputation. A design that applied both mode A and mode B to the complete sample would mechanically reveal the proportion of type I units in the population, and by implication would identify the proportion of type II, III, and IV units. For type II units we could use mode A responses to improve imputations for mode B responses, and vice versa for type III respondents. Type IV respondents’ contributions to our estimate of the “average response” would be based purely on auxiliary information. Again, one could construct worst case bounds by imputing maximum and minimum response values for each of the missing response types.
One wrinkle that I ignored above was that the order of modes of contact may affect either response behavior or outcomes reported. This multiplies the number potential response behaviors and the number of potential outcome responses given that the unit is interviewed. You could get some way past these issues by randomizing the order of mode of contact—e.g. A then B for one half, and B then A for the other half. But you would have to impose some more assumptions to make use of this random assignment. E.g., you’d have to assume that A-then-B always-responders are exchangeable with B-then-A always responders in order to combine the information from the always-responders in each half-sample. Or, you could “shift the goal posts” by saying that all you are interested in is the average of responses from modes A and B under the A-then-B design.
Update:
The above analysis did not explore how other types of assumptions might help to identify the population average. Andy’s proposal to use post-stratification and regressions relies (according to my understanding) on the assumption potential outcomes are independent of mode of contact conditional on covariates. Formally, if the mode of contact is $latex M$ taking on values $latex A$ or $latex B$, potential outcomes under mode of contact $latex m$ is $latex y(m)$, $latex T$ is principal stratum, and $latex X$ is a covariate, then $latex \left[y(A),y(B)\right] \perp M | T, X$ implies that,
$latex E(y(m)|T,X) = E(y(m)|M=m, T,X) = E(y(m)|M \ne m, T,X)$.
As discussed above, the design that applies modes A and B to all units in the sample can determine principal stratum membership, and so these covariate- and principal-stratum specific imputations can be applied. Ordering effects will again complicate things, and so more assumptions would be needed. A worthwhile type of analysis would be to study evidence of mode-of-contact as well as ordering effects among the type I (always-responder) units.
Now, it may be that mode of contact affects response but units are contacted via either mode A or B. Then, a unit’s principal stratum membership is not identifiable, nor is the proportion of types I through IV identifiable (we would end up with two mixtures of responding and non-responding types, with no way to parse out relative proportions of the different types). If some kind of response “monotonicity” held, then that would help a little. Response monotonicity would mean that either type II or type III responders didn’t exist. Otherwise, we would have to impose more stringent assumptions. The common one would be that principal stratum membership is independent of potential responses conditional on covariates. This is a classic “ignorable non-response” assumption, and it suffers from having no testable implications.
Michela Wrong on corruption, ethnicity, and development in Kenya
Within minutes of the announcement of Kibaki’s victory, the multi-ethnic settlements of Nairobi, Mombasa, Kisumu, Eldoret and Kakamega erupted. Luo and Luhya ODM supporters armed with metal bars, machetes and clubs vented their frustration and fury on local Kikuyu and members of the smaller, pro-PNU Meru, Embu and Kisii tribes, setting fire to homes and shops. The approach was brutally simplistic. Many Kikuyu, especially the young, urban poor, had actually voted ODM, regarding Raila, ‘the People’s President’, as far more sympathetic to their needs than the aloof Kibaki. But mobs don’t do nuance. Fury needs a precise shape and target if it is to find expression, and ethnicity provided that fulcrum.
[U]nder a system which decreed that all advancement was determined by tribe, such hostility was entirely rational. Had all Kenyans believed they enjoyed equal access to state resources, there would have been no explosion.
Nowhere was this dawning of ethnic self-awareness more sudden than in the slums, Kenya’s melting pots, where new frontiers coagulated like DNA strands, forming as suddenly on the ground as they had in people’s minds. The notion that urban youth would serve as midwives to the birth of a cosmopolitan, united nation looked like idealistic nonsense–the worst violence took place in places like Kibera and Mathare, and it was committed by youngsters.
In the space of only two months, Kenya had changed beyond recognition. Rolling back the migration trends of half a century, a process of self-segregation was under way.
‘The generation that harboured that kind of ethnic hatred was dying away,’ says John Kiriamiti. A former bank robber, he renounced crime to become a respectable newspaper publisher in Muranga, and now quails at the violence he once took in his stride. ‘Our children didn’t know about it. But they have understood it now, and it will take a long, long time to vanish.’
Quotes on the 2007-8 electoral crisis in Kenya from Michaela Wrong’s It’s Our Turn to Eat (link), which I just finished. Most of the book follows the saga of Kibaki’s former anti-corruption adviser and whistleblower, John Githongo. It makes for a gripping narrative through which Wrong provides some nice insights on how an ethnic winner-take-all mentality has undermined Kenyan democratic politics and created pressures that erupted in the 2007-8 electoral crisis. The question naturally arises: what institutional reforms might allow a society to transcend a perilous inter-ethnic dynamic such as this? Are quotas or integrative institutions useful or harmful? This is part of some new research that Elisabeth King and I are currently undertaking. Watch this space for updates.
Advice from G. Imbens on design & analysis of RCTs
At the International Initiative for Impact Evaluation (3ie) conference in Cuernavaca this past week (link), Guido Imbens from Harvard University gave two lectures on standards for the design and analysis of randomized control trials (RCTs). I thought it was worth a rather lengthy post to describe what he covered. These are very important insights, and only a portion of what he discussed is posted to the web. The first lecture drew mostly from his book manuscript with Donald Rubin, while a paper associated with the second lecture is here (link updated link). The summary below is based on my understanding of his lectures.
Imbens’s first lecture focused on analyzing data from RCTs using randomization-based methods. The standard analysis that Imbens proposes includes (1) a Fisher-type permutation test of the sharp null hypothesis—what Imbens referred to as “testing”—along with a (2) Neyman-type point estimate of the sample average treatment effect and confidence interval—what Imbens referred to as “estimation.” (See these past posts for some discussion of related points: Fisher style testing and Neyman style estimation.) Interestingly, and in a way that runs contrary to Rosenbaum’s proposed method of analysis (link), Imbens claimed that testing and estimation are separate enterprises with separate goals and that the two should not be confused. I took it as a warning against proposals that use “inverted” tests in order to produce point estimates and confidence intervals. There is no reason that such confidence intervals will have accurate coverage except under rather dire assumptions, meaning that they are not “confidence intervals” in the way that we usually think of them. (This is a point that Peter Aronow and I have developed formally in a paper that hopefully will be published soon.) Thus, Imbens’s suggestion was that a rank-based test was a good choice for null hypothesis testing, owing to its insensitivity to outliers and relative power (i.e., its Pittman efficiency), but that estimation should be based on sample theoretic (Neyman-type) principles. In most practical cases, ordinary least squares (OLS) regression with robust standard errors produces estimates that are, in fact, justified on sample theoretic grounds (even if “randomization does not justify regression assumptions”, as Freedman famously noted).
Imbens devoted a good amount of time in the first lecture to special challenges in analyzing cluster randomized experiments. Cluster randomized trials present analysts with a choice between estimating a treatment effect defined at the level of individual units versus a treatment effect defined at the cluster level. Typically it is the former (unit-level) treatment effect that interests us. However, unbiased estimation is complicated by the fact that cluster-level assignment, combined with variation in cluster sizes, implies variation in unit-level treatment assignment propensities. (NB: What matters is the size of the cluster in the population, not the size of the sample from the cluster. Only under extremely artificial circumstances would cluster sizes ever be equal, in which case these problems are pretty much always relevant in cluster randomized trials.) One may use weights to account for these differences in assignment propensities, but this introduces other issues: simply weighting by relative cluster size introduces a scale invariance problem. Normalizing the weights removes this problem but introduces a ratio estimation problem that creates finite sample bias. (The ratio estimation problem arises because the normalization depends on the average cluster sizes in the treated and control groups, which is random.) However, the bias is typically small in moderate or large samples. The cluster level analysis has none of these problems, since aggregating to the cluster level results in a simple randomized experiment. Given these issues, Imbens’s recommendation was to do both a cluster-level analysis and unit-level analysis, with testing and power analysis focused on the cluster level analysis, and estimation carried out at both levels, using normalized weights for the unit-level analysis. (The issue has also received a nice, thorough treatment by Aronow and Middleton in a paper presented at MPSA last spring—link. They come up with other randomization-based strategies for analyzing cluster randomized trials.)
The second lecture was on design considerations, focusing primarily on (1) optimal design for cluster randomized trials and (2) approaches to “re-randomization” and generalized pre-treatment covariate balancing. On the design of cluster randomized trials, Imbens reviewed an ongoing debate in the statistical literature, in which (1) Snedecor/Cochran and Box/Hunter/Hunter have proposed that in cluster RCTs with small or moderate numbers of clusters, we might have good reason to avoid stratification or pair matching, while (2) Donner/Klar have proposed that some stratification is almost always useful but that pair matching creates too many analytic problems to be worth it, while most recently, (3) Imai/King/Nall have proposed that pair-matching should always be done when possible. Imbens ultimately comes out on the side of Donner/Klar. His take is that the very existence of the debate is based largely on the confusion between testing and estimation: stratification (with pair matching as the limit) always makes estimation more efficient, but under some situations may result in reduced power for testing. This may seem paradoxical, but remember that that power is a function of how a design affects the variability in point estimates (efficiency, or $latex V(\hat \beta)$) relative to the effects of the design on the variability of the estimates of the variance of point estimates (that is, $latex V[\hat V(\hat \beta)]$). This is apparent when you think of power in terms of the probability of rejecting the null using a t-statistic, for which the point estimate is in the numerator and square root of the variance estimate is in the denominator. Noise in both the numerator and denominator affect power. Even though stratification will always reduce the variability of the point estimate, in some cases the estimate of the variance of this point estimate can become unstable, undermining power.
That being the case, clear and tractable instances under which stratification leads to a loss in power arise only in very rigidly defined scenarios. Imbens demonstrated a case with constant treatment effects, homoskedasticity, all normal data, and an uninformative covariate in which a t-test loses power from stratification. But loosening any of these conditions led to ambiguous results, as did the replacement of the t-test with permutation tests. These pathological cases are not a reliable basis for selecting a design. Rather, a more reliable basis is the fact that stratification always improves efficiency. The upshot is that some pre-treatment stratification (i.e., blocking) is always recommended, and this is true whether or not the trial is cluster randomized or unit randomized.
The question then becomes, how much stratification? Here, Imbens disagreed with Imai/King/Nall by proposing that pair-matching introduces more analytical problems than its worth relative to lower-order stratification. The problem with pair matching is that the within-pair variance for the estimated treatment effect is unidentified (you need at least two treated and two control units to identify the estimated treatment effect variance). Thus, one must use the least upper bound identified by the data, which is driven by the between-pair variance. This leads to an overconservative estimator that tosses out efficiency gains from pairing. It also prevents examination of heterogeneity that may be of interest. Thus, Imbens’s recommendation was stratification up to at least two-treated and two-control units or clusters per stratum.
(I asked Imbens how he would propose to carry out the stratification, especially when data are of mixed types and sparseness was a potential problem. His recommendation was dimension reduction. Thus, one could use predicted baseline values from a rich covariate model, clustering algorithms constrained by baseline values, or slightly more flexible clustering. The goal is to reduce the stratification problem to one or a few manageable dimensions.)
The lecture on design also covered some new (and rather inchoate) ideas on “re-randomization” and generalized covariate balancing. The problem that he was addressing was the situation where you perform a randomization, but then note problematic imbalances in the resulting experiment. Should you rerandomize? If so, what should be the principles to guide you? Imbens’s take was that, yes, you should rerandomize, but that it should be done in a systematic manner. The process should (1) define, a priori, rejection rules for assignments, (2) construct the set of all possible randomizations (or an arbitrarily large sample of all possible randomizations), (3) apply the rejection rule, and then (4) perform the actual treatment assignment by randomizing within this set of acceptable randomizations. This type of “restricted randomization” typically uses covariate balance criteria to form the rejection rules; as such, it generalizes the covariate balancing objective that typically motivates stratification.
Some complications arise however. There is no clear basis on which one should define a minimum balance criterion. One could seek to maximize the balance that would obtain in an experiment. Doing so, one would effectively partition the sample into two groups that are maximally balanced, and randomly select one of the two partitions to be the treatment group. But in this case, the experiment would reduce to a cluster randomized experiment on only two clusters! There would be a strong external validity cost, and testing based on permutation would be degenerate. I asked Imbens about this, and his recommendation was that the rejection rules used to perform restricted randomization should maintain a “reasonable” degree of randomness in the assignment process. I think the right solution here is still an open question.
Imbens’s lectures are part of 3ie’s efforts toward standardization in randomized impact evaluations—standardization that greatly facilitates systematic reviews and meta-analyses of interventions, which is another of 3ie’s core activities (link). For those involved in RCTs and field experiments in development, I highly recommend that you engage 3ie’s work.
Why positive theorists should love preregistration & ex ante science
Over at the IQSS Social Science Statistics blog, Richard Nielsen had a great post on pre-registration (link). He writes,
In response to a comment by Chris Blattman, the Givewell blog has a nice post with “customer feedback” for the social sciences. Number one on the wish-list is pre-registration of studies to fight publication bias — something along the lines of the NIH registry for clinical trials. I couldn’t agree more. I especially like that Givewell’s recommendations go beyond the usual call for RCT registration to suggest that we should also be registering observational studies.
As Richard notes, much of the interest in pre-registration is to reduce the publication bias that most certainly afflicts us (evidence from Gerber and Malhotra here). As a result, like in medicine, most published research findings are probably false (link). Some more arguments in favor of pre-registration to control publication bias for both clinical trials and observational studies are given by Neuroskeptic (link).
What I want to propose that the pre-registration, and the “ex ante science” ideal on which it is based, is great for positive theorists, especially for those who want to do mostly or even only positive theory. How so? For two reasons. First, theory and substantive insights are what editorial boards would need to judge the quality and relevance of research questions and hypotheses. Second, ex ante science provides a steadier stream of puzzles that theorists can delight in trying to work out.
A central role in publication decision processes
Obviously, the quality and relevance of hypotheses ought to be taken into account in judging whether a submission is publication worthy. For a causal or descriptive analysis, identification alone is insufficient to make a study worth one’s time. We want to know if the questions being asked are important and whether the hypotheses are coherent. Consider a journal publication decision mechanism that applies the logic of ex ante science. A proposed submission comes in with hypotheses and research design spelled out, but no analysis has actually been done. Based on the hypotheses and design, the editorial board makes an accept, reject, or revise-and-resubmit decision. This ultimate accept/reject decision is made publicly, pre-registering the accepted study. It is similar to what happens these days with grant proposals and IRB reviews, but it is explicitly tied to a public publication commitment. Once the data are gathered and results processed, the paper is checked against the accepted and publicly registered hypotheses and design. So long as it conforms, it is published irrespective of the results.
Theorists would play a crucial role in the initial decision. If the study examines the effect of a policy, on what basis should we expect any meaningful effects? Are the hypotheses compelling? Will the results do much to affect our understanding of an important problem? Strong deductive reasoning and substantive familiarity with the problem are what shape our priors, and such is exactly what we need to make this call.
A steadier stream of meaningful puzzles
Popperian epistemology proposes that knowledge advances through falsification, but current pro-“significance” bias means that this happens too infrequently. We end up continuing to give credibility to theoretical propositions that empirical researchers are too afraid to falsify. We fail to offer theorists all the puzzles that they deserve. Ex ante science will provide a steadier stream of puzzles meaning more work and more fun for those who want to focus mostly on theory, and more meaningful interaction between theory and empirical research.