Regression discontinuity designs and missing data: some non-intuitive results

As with all sorts of data analysis scenarios, when you are carrying out a regression discontinuity (RD) study, you might have some missing data. For example, suppose you are using the (now, classic) Lee (2008) research design to estimate the effects of being elected to political office (ungated link; but note critiques of this design as discussed here: link). In this design, we exploit the fact that in plurality winner-take-all elections, there is an “as if” random aspect to the outcomes of elections that have very close margins of victory. E.g., a two-candidate race in which the winner got 50.001% of the vote and the loser got 49.999% is one in which one could imagine the outcome having gone the other way. Thus, using margin of victory as a forcing variable sets you up for a nice RD design. But, the rub is that you need data on both the winners and the losers in order to track the effects of winning versus losing. Unfortunately, it may be the case that data on the losers is hard to come by. They may drop out of the public eye, and so you will have a selected set of losers that you are able to observe. That is, you have missing outcome data on losers.

In a working paper in which I review semi-parametric methods for handling missing data, I discuss how inverse probability weighting (IPW), imputation, and their combination via “augmented” inverse probability weighting (AIPW) might be well suited to the task to dealing with missing data for RD designs (link to working paper). The problem with these methods, though, is that they rely on an “ignorability” assumption: that is, for the cases whose outcomes are missing, you nonetheless have enough other information on them to be able either (i) to predict well whether they are missing or (ii) to predict well what their outcomes would have been had they been observed. Ignorability is a strong and untestable assumption. Thus, in the paper I discuss very briefly the idea of doing sensitivity analysis to examine violations of ignorability.

I am currently re-working that paper, and in doing so working out how exactly such a sensitivity analysis ought to be carried out. As a useful prompt, I received an email recently asking precisely what one might do to study sensitivity to missing data in the RD scenario. My proposal was that one could do a couple of things. First, one could compute bounds based on imputing extreme values for the missing outcomes, and seeing what that suggests. This would be along the lines of a Manski-type “partial identification” approach to studying sensitivity to missing data (see Ch. 2 in this book: link). Second, one could do an IPW adjusted analysis (keeping to the side imputation and AIPW for the moment), and see how your results change. Third, one could do a sensitivity analysis for the IPW analysis. A sensitivity analysis that I imagine is to take the complete cases and residualize their outcome variable values relative to the forcing variable and any other covariates used to predict the missingness weights. Then, scale these residuals relative to the strongest predictor of missingness (determined, for example, using a standardized regression analysis). And finally, examine the consequences of increasing the influence of these scaled residuals in predicting missingness. Then, you would have a way to examine sensitivity to ignorability in a manner that is scaled to the strongest predictor and its influence.

These suggestions were a bit off the cuff, although they made intuitive sense. Nonetheless, I wanted to check myself. So I did a toy simulation. The findings were surprising.

The R code for the toy simulation is here: R code.[1] The code contains two examples of trying to predict an intercept value using a local linear approximation to a non-linear relationship. This is one side of what one is trying to do in a standard RD analysis. In one example, ignorability holds, and in the other it does not. Ideally, what I would like you to do is to run that code line by line in R and look at the graphical output that shows the result of each approach to addressing the missing data problem. The steps include, first computing a benchmark prediction that would obtain were there no missingness (“all data” scenario). Then, we look at predictions resulting from,

(1) doing nothing — aka, complete case analysis (“compl. cases” scenario);
(2) imputing extrema (“imputation lower bound” and “imputation upper bound”, where lower and upper are referring to whether min or max values are imputed);
(3) IPW adjustment (“IPW”), and then
(4) an IPW sensitivity analysis along the lines discussed above (“IPW low” and “IPW high”).

For those who can’t run the code, here is a (very crowded) PDF that graphs color-coded results for the two examples: PDF. You will have to zoom in to make sense of it. It shows where all the predictions landed. We want to be in the neighborhood of the intercept prediction labeled “all data”.

Here are the basic conclusions from this toy simulation. First, and most importantly, the bounds from imputing extrema don’t necessarily cover what you would get if you had all the data! This occurred in the ignorable case. This was surprising to me and it’s worth considering more deeply. The problem is due to the fact that one is trying to predict an intercept here, and so imputing high or low values for the missing data does not necessarily imply that the resulting intercept estimate will cover what you would get with the full data. It seems that the linear approximation to the non-linear relationship is compounding the problem, but that is a conjecture that needs to be assessed analytically. I found this quite interesting, and it suggests that what we know about sensitivity analysis for simple difference in means types estimation does not necessarily travel to the RD world.

Second, the IPW sensitivity analysis is pretty straightforward, and works as expected. However, it requires that you choose—more or less, out of thin air—what defines an “extreme” violation of ignorability. Also, the IPW sensitivity analysis still leaves you with a fairly tight range of possible outcomes. This is not necessarily a good thing, because the tightness might imply that there are still some assumptions that aren’t being subjected to enough scrutiny. So I think this is a promising approach, but probably needs a lot of consideration and justification when used in practice.

Third, in these examples, IPW always removed at least some of the bias, although there are cases where this may not happen (see p. 21 of my paper linked above for an example).

So there are some interesting wrinkles here, the most important being that Manski type approaches may not travel well to the RD scenario. I need to check to be sure I’m not doing something wrong, but assuming I didn’t that’s an important take-away.

[1] Okay, the code is pretty rough in that lots of things are repeated that should be routinized into functions, but you know what? I never claimed to be a computer programmer. Just someone who knows enough to get what I need!

Share

Notes on regression discontinuity designs and analysis

  1. A post from last week by Jenny Aker at the Savings Revolution blog (link) proposes strategies for rigorous impact assessment when full randomization is not possible. Her third suggestion is a friendly way of saying that regression discontinuity (RD) designs should be used more, not only in analyzing existing interventions, but in designing new ones. If we can use quantifiable indices to determine who qualifies to receive program benefits, and if the indices are used faithfully in actually determining who gets benefits, then we can use the indices to carry out an RD analysis. This should be appealing to practitioners because it provides a transparent and relatively incorruptible method for beneficiary selection and it is sensitive to concerns that those most in need be most eligible for assistance, while minimally compromising our ability to estimate program impacts. As methodologists, I think we need to do more to sell this approach in cases where full randomization is not feasible.

  2. A relatively new paper by Papay et al. in Journal of Econometrics (gated link) demonstrates ways to generalize RD analysis to multiple assignment variables and cutoffs in multiple dimensions. The killer graph from the paper is shown above. In this case, you have treatment assignment based on cutoffs on two variables, labeled as X1 and X2 on the graph (the vertical axis is the outcome variable). Cutoffs in two dimensions create four treatment regions, A, B, C, and D. The analysis proceeds by using a regression to model the response surface in each region. Then, you can obtain predicted values along each of the discontinuity edges. These predictions can be subtracted from each other and aggregated to produce various types of average treatment effects. All of this can happen more or less automatically with a single regression specification, although one should take care to understand the manner in which such a regression “averages” the various available treatment effects (I believe that it produces a covariance-weighted average, rather than a sample weighted average, along the lines of what Angrist and Pischke discuss in Mostly Harmless…).

  3. A colleague and I were discussing tests for the identifying assumptions for RD. It seems that there have been some calls to test for “balance” in covariates around cutpoints to assess whether identifying assumptions are met for RD. The idea of these tests is that in the neighborhood of the cutpoint, covariate distributions should be equal. Balance is thus tested using the permutation distribution under this null hypothesis. To me, this sounds like one is imposing more assumptions than necessary for an RD design. RD requires smoothness in covariates, not balance. The “R” in RD is there for a reason. If balance were a necessity, we should just call it “D”! Covariate means might differ on either side of the cutpoint within arbitrarily small windows, without there being a violation of the smoothness condition. In this case, a balance test would lead one to conclude that identifying conditions are not met when in fact they are (that is, the test would be trigger happy on the type II error rate). The direct test for smoothness is a “placebo” regression of the covariate, where you estimate the existence of a discontinuity (refer to Imbens and Lemieux, gated link). I suppose one could construct a permutation test that also looks for smoothness/discontinuities, but the balance tests on adjusted covariates strikes me as erroneous.

Share

Combining information from surveys that use multiple modes of contact

Over at the Statistical Modeling, Causal Inference, and Social Science blog (link), Andrew Gelman writes,

I’m involved (with Irv Garfinkel and others) in a planned survey of New York City residents. It’s hard to reach people in the city–not everyone will answer their mail or phone, and you can’t send an interviewer door-to-door in a locked apartment building. (I think it violates IRB to have a plan of pushing all the buzzers by the entrance and hoping someone will let you in.) So the plan is to use multiple modes, including phone, in person household, random street intercepts and mail.

The question then is how to combine these samples. My suggested approach is to divide the population into poststrata based on various factors (age, ethnicity, family type, housing type, etc), then to pool responses within each poststratum, then to runs some regressions including postratsta and also indicators for mode, to understand how respondents from different modes differ, after controlling for the demographic/geographic adjustments.

Maybe this has already been done and written up somewhere?


It’s interesting to consider this problem by combining a “finite population” perspective with some ideas about “principal strata” from the causal inference literature. Suppose a finite population U from which we draw a sample of N units. We have two modes of contact, A and B. Suppose for the moment that each unit can be characterized by one of the following response types (these are the “principal strata”):

Type Mode A response Mode B response
I 1 1
II 1 0
III 0 1
IV 0 0


Then, there are two cases to consider, depending on whether mode of contact affects response:

Mode of contact does not affect response

This might be a valid assumption if the questions of interest are not subject to social desirability biases, interviewer effects, etc. In this case, it is easy to define a target parameter as the average response in the population. You could proceed efficiently by first applying mode A to the sample, and then applying mode B to those who did not respond with mode A. At the end, you would have outcomes for types I, II, and III units, and you’d have an estimate of the rate of type IV units in the population. You could content yourself with an estimate for the average response on the type I, II, and III subpopulation. If you wanted to recover an estimate of the average response for the full population (including type IV’s), you would effectively have to impute values for type IV respondents. This could be done by using auxiliary information either to genuinely impute or (in a manner that is pretty much equivalent) to determine which type I, II, or III units resemble the missing type IV units, and up-weight. In any case, if the response of interest has finite support, one could also compute “worst case” (Manski-type) bounds on the average response by imputing maximum and minimum values to type IV units.

Mode of contact affects response

This might be relevant if, for example, the modes of contact are phone call versus face-to-face interview, and outcomes being measured vary depending on whether the respondent feels more or less exposed in the interview situation. This possibility makes things a lot trickier. In this case, each unit is characterized by a response under mode A and another under mode B (that is, two potential outcomes). One immediately faces a quandary of defining the target parameter. Is it the average of responses under the two modes of contact? Maybe it is some “latent” response that is imperfectly revealed under the two modes of contact? If so, how can we characterize this “imperfection”? Furthermore, only for type I individuals will you be able to obtain information on both potential responses. Does it make sense to restrict ourselves to this subpopulation? If not, then we would again face the need for imputation. A design that applied both mode A and mode B to the complete sample would mechanically reveal the proportion of type I units in the population, and by implication would identify the proportion of type II, III, and IV units. For type II units we could use mode A responses to improve imputations for mode B responses, and vice versa for type III respondents. Type IV respondents’ contributions to our estimate of the “average response” would be based purely on auxiliary information. Again, one could construct worst case bounds by imputing maximum and minimum response values for each of the missing response types.

One wrinkle that I ignored above was that the order of modes of contact may affect either response behavior or outcomes reported. This multiplies the number potential response behaviors and the number of potential outcome responses given that the unit is interviewed. You could get some way past these issues by randomizing the order of mode of contact—e.g. A then B for one half, and B then A for the other half. But you would have to impose some more assumptions to make use of this random assignment. E.g., you’d have to assume that A-then-B always-responders are exchangeable with B-then-A always responders in order to combine the information from the always-responders in each half-sample. Or, you could “shift the goal posts” by saying that all you are interested in is the average of responses from modes A and B under the A-then-B design.

Update:

The above analysis did not explore how other types of assumptions might help to identify the population average. Andy’s proposal to use post-stratification and regressions relies (according to my understanding) on the assumption potential outcomes are independent of mode of contact conditional on covariates. Formally, if the mode of contact is $latex M$ taking on values $latex A$ or $latex B$, potential outcomes under mode of contact $latex m$ is $latex y(m)$, $latex T$ is principal stratum, and $latex X$ is a covariate, then $latex \left[y(A),y(B)\right] \perp M | T, X$ implies that,

$latex E(y(m)|T,X) = E(y(m)|M=m, T,X) = E(y(m)|M \ne m, T,X)$.

As discussed above, the design that applies modes A and B to all units in the sample can determine principal stratum membership, and so these covariate- and principal-stratum specific imputations can be applied. Ordering effects will again complicate things, and so more assumptions would be needed. A worthwhile type of analysis would be to study evidence of mode-of-contact as well as ordering effects among the type I (always-responder) units.

Now, it may be that mode of contact affects response but units are contacted via either mode A or B. Then, a unit’s principal stratum membership is not identifiable, nor is the proportion of types I through IV identifiable (we would end up with two mixtures of responding and non-responding types, with no way to parse out relative proportions of the different types). If some kind of response “monotonicity” held, then that would help a little. Response monotonicity would mean that either type II or type III responders didn’t exist. Otherwise, we would have to impose more stringent assumptions. The common one would be that principal stratum membership is independent of potential responses conditional on covariates. This is a classic “ignorable non-response” assumption, and it suffers from having no testable implications.

Share

Michela Wrong on corruption, ethnicity, and development in Kenya

Within minutes of the announcement of Kibaki’s victory, the multi-ethnic settlements of Nairobi, Mombasa, Kisumu, Eldoret and Kakamega erupted. Luo and Luhya ODM supporters armed with metal bars, machetes and clubs vented their frustration and fury on local Kikuyu and members of the smaller, pro-PNU Meru, Embu and Kisii tribes, setting fire to homes and shops. The approach was brutally simplistic. Many Kikuyu, especially the young, urban poor, had actually voted ODM, regarding Raila, ‘the People’s President’, as far more sympathetic to their needs than the aloof Kibaki. But mobs don’t do nuance. Fury needs a precise shape and target if it is to find expression, and ethnicity provided that fulcrum.

[U]nder a system which decreed that all advancement was determined by tribe, such hostility was entirely rational. Had all Kenyans believed they enjoyed equal access to state resources, there would have been no explosion.

Nowhere was this dawning of ethnic self-awareness more sudden than in the slums, Kenya’s melting pots, where new frontiers coagulated like DNA strands, forming as suddenly on the ground as they had in people’s minds. The notion that urban youth would serve as midwives to the birth of a cosmopolitan, united nation looked like idealistic nonsense–the worst violence took place in places like Kibera and Mathare, and it was committed by youngsters.

In the space of only two months, Kenya had changed beyond recognition. Rolling back the migration trends of half a century, a process of self-segregation was under way.

‘The generation that harboured that kind of ethnic hatred was dying away,’ says John Kiriamiti. A former bank robber, he renounced crime to become a respectable newspaper publisher in Muranga, and now quails at the violence he once took in his stride. ‘Our children didn’t know about it. But they have understood it now, and it will take a long, long time to vanish.’


Quotes on the 2007-8 electoral crisis in Kenya from Michaela Wrong’s It’s Our Turn to Eat (link), which I just finished. Most of the book follows the saga of Kibaki’s former anti-corruption adviser and whistleblower, John Githongo. It makes for a gripping narrative through which Wrong provides some nice insights on how an ethnic winner-take-all mentality has undermined Kenyan democratic politics and created pressures that erupted in the 2007-8 electoral crisis. The question naturally arises: what institutional reforms might allow a society to transcend a perilous inter-ethnic dynamic such as this? Are quotas or integrative institutions useful or harmful? This is part of some new research that Elisabeth King and I are currently undertaking. Watch this space for updates.

Share

Advice from G. Imbens on design & analysis of RCTs

At the International Initiative for Impact Evaluation (3ie) conference in Cuernavaca this past week (link), Guido Imbens from Harvard University gave two lectures on standards for the design and analysis of randomized control trials (RCTs). I thought it was worth a rather lengthy post to describe what he covered. These are very important insights, and only a portion of what he discussed is posted to the web. The first lecture drew mostly from his book manuscript with Donald Rubin, while a paper associated with the second lecture is here (link updated link). The summary below is based on my understanding of his lectures.

Imbens’s first lecture focused on analyzing data from RCTs using randomization-based methods. The standard analysis that Imbens proposes includes (1) a Fisher-type permutation test of the sharp null hypothesis—what Imbens referred to as “testing”—along with a (2) Neyman-type point estimate of the sample average treatment effect and confidence interval—what Imbens referred to as “estimation.” (See these past posts for some discussion of related points: Fisher style testing and Neyman style estimation.) Interestingly, and in a way that runs contrary to Rosenbaum’s proposed method of analysis (link), Imbens claimed that testing and estimation are separate enterprises with separate goals and that the two should not be confused. I took it as a warning against proposals that use “inverted” tests in order to produce point estimates and confidence intervals. There is no reason that such confidence intervals will have accurate coverage except under rather dire assumptions, meaning that they are not “confidence intervals” in the way that we usually think of them. (This is a point that Peter Aronow and I have developed formally in a paper that hopefully will be published soon.) Thus, Imbens’s suggestion was that a rank-based test was a good choice for null hypothesis testing, owing to its insensitivity to outliers and relative power (i.e., its Pittman efficiency), but that estimation should be based on sample theoretic (Neyman-type) principles. In most practical cases, ordinary least squares (OLS) regression with robust standard errors produces estimates that are, in fact, justified on sample theoretic grounds (even if “randomization does not justify regression assumptions”, as Freedman famously noted).

Imbens devoted a good amount of time in the first lecture to special challenges in analyzing cluster randomized experiments. Cluster randomized trials present analysts with a choice between estimating a treatment effect defined at the level of individual units versus a treatment effect defined at the cluster level. Typically it is the former (unit-level) treatment effect that interests us. However, unbiased estimation is complicated by the fact that cluster-level assignment, combined with variation in cluster sizes, implies variation in unit-level treatment assignment propensities. (NB: What matters is the size of the cluster in the population, not the size of the sample from the cluster. Only under extremely artificial circumstances would cluster sizes ever be equal, in which case these problems are pretty much always relevant in cluster randomized trials.) One may use weights to account for these differences in assignment propensities, but this introduces other issues: simply weighting by relative cluster size introduces a scale invariance problem. Normalizing the weights removes this problem but introduces a ratio estimation problem that creates finite sample bias. (The ratio estimation problem arises because the normalization depends on the average cluster sizes in the treated and control groups, which is random.) However, the bias is typically small in moderate or large samples. The cluster level analysis has none of these problems, since aggregating to the cluster level results in a simple randomized experiment. Given these issues, Imbens’s recommendation was to do both a cluster-level analysis and unit-level analysis, with testing and power analysis focused on the cluster level analysis, and estimation carried out at both levels, using normalized weights for the unit-level analysis. (The issue has also received a nice, thorough treatment by Aronow and Middleton in a paper presented at MPSA last spring—link. They come up with other randomization-based strategies for analyzing cluster randomized trials.)

The second lecture was on design considerations, focusing primarily on (1) optimal design for cluster randomized trials and (2) approaches to “re-randomization” and generalized pre-treatment covariate balancing. On the design of cluster randomized trials, Imbens reviewed an ongoing debate in the statistical literature, in which (1) Snedecor/Cochran and Box/Hunter/Hunter have proposed that in cluster RCTs with small or moderate numbers of clusters, we might have good reason to avoid stratification or pair matching, while (2) Donner/Klar have proposed that some stratification is almost always useful but that pair matching creates too many analytic problems to be worth it, while most recently, (3) Imai/King/Nall have proposed that pair-matching should always be done when possible. Imbens ultimately comes out on the side of Donner/Klar. His take is that the very existence of the debate is based largely on the confusion between testing and estimation: stratification (with pair matching as the limit) always makes estimation more efficient, but under some situations may result in reduced power for testing. This may seem paradoxical, but remember that that power is a function of how a design affects the variability in point estimates (efficiency, or $latex V(\hat \beta)$) relative to the effects of the design on the variability of the estimates of the variance of point estimates (that is, $latex V[\hat V(\hat \beta)]$). This is apparent when you think of power in terms of the probability of rejecting the null using a t-statistic, for which the point estimate is in the numerator and square root of the variance estimate is in the denominator. Noise in both the numerator and denominator affect power. Even though stratification will always reduce the variability of the point estimate, in some cases the estimate of the variance of this point estimate can become unstable, undermining power.

That being the case, clear and tractable instances under which stratification leads to a loss in power arise only in very rigidly defined scenarios. Imbens demonstrated a case with constant treatment effects, homoskedasticity, all normal data, and an uninformative covariate in which a t-test loses power from stratification. But loosening any of these conditions led to ambiguous results, as did the replacement of the t-test with permutation tests. These pathological cases are not a reliable basis for selecting a design. Rather, a more reliable basis is the fact that stratification always improves efficiency. The upshot is that some pre-treatment stratification (i.e., blocking) is always recommended, and this is true whether or not the trial is cluster randomized or unit randomized.

The question then becomes, how much stratification? Here, Imbens disagreed with Imai/King/Nall by proposing that pair-matching introduces more analytical problems than its worth relative to lower-order stratification. The problem with pair matching is that the within-pair variance for the estimated treatment effect is unidentified (you need at least two treated and two control units to identify the estimated treatment effect variance). Thus, one must use the least upper bound identified by the data, which is driven by the between-pair variance. This leads to an overconservative estimator that tosses out efficiency gains from pairing. It also prevents examination of heterogeneity that may be of interest. Thus, Imbens’s recommendation was stratification up to at least two-treated and two-control units or clusters per stratum.

(I asked Imbens how he would propose to carry out the stratification, especially when data are of mixed types and sparseness was a potential problem. His recommendation was dimension reduction. Thus, one could use predicted baseline values from a rich covariate model, clustering algorithms constrained by baseline values, or slightly more flexible clustering. The goal is to reduce the stratification problem to one or a few manageable dimensions.)

The lecture on design also covered some new (and rather inchoate) ideas on “re-randomization” and generalized covariate balancing. The problem that he was addressing was the situation where you perform a randomization, but then note problematic imbalances in the resulting experiment. Should you rerandomize? If so, what should be the principles to guide you? Imbens’s take was that, yes, you should rerandomize, but that it should be done in a systematic manner. The process should (1) define, a priori, rejection rules for assignments, (2) construct the set of all possible randomizations (or an arbitrarily large sample of all possible randomizations), (3) apply the rejection rule, and then (4) perform the actual treatment assignment by randomizing within this set of acceptable randomizations. This type of “restricted randomization” typically uses covariate balance criteria to form the rejection rules; as such, it generalizes the covariate balancing objective that typically motivates stratification.

Some complications arise however. There is no clear basis on which one should define a minimum balance criterion. One could seek to maximize the balance that would obtain in an experiment. Doing so, one would effectively partition the sample into two groups that are maximally balanced, and randomly select one of the two partitions to be the treatment group. But in this case, the experiment would reduce to a cluster randomized experiment on only two clusters! There would be a strong external validity cost, and testing based on permutation would be degenerate. I asked Imbens about this, and his recommendation was that the rejection rules used to perform restricted randomization should maintain a “reasonable” degree of randomness in the assignment process. I think the right solution here is still an open question.

Imbens’s lectures are part of 3ie’s efforts toward standardization in randomized impact evaluations—standardization that greatly facilitates systematic reviews and meta-analyses of interventions, which is another of 3ie’s core activities (link). For those involved in RCTs and field experiments in development, I highly recommend that you engage 3ie’s work.

Share