# (technical) imputation, ipw, causal inference (Snowden et al, 2011, with disc)

In the advance access pages of the American Journal of Epidemiology, Jonathan M. Snowden, Sherri Rose, and Kathleen M. Mortimer have a nice tutorial on what they refer to as “G-computation” for causal inference with observational data (ungated link). An average causal effect for a binary treatment can be defined as the average of individual level differences between the outcome that obtains when one is in treatment versus in control. Because people are either in treatment or control, one of these two “potential” outcomes is unobserved, or missing (within subjects designs do not overcome this, because the ordering of treatment assignment is itself another dimension of treatment). Given this “missing data” problem, G-computation refers to fitting models to available data that allow you to impute (i.e., predict) unobserved counterfactual values. You can then use this complete set of counterfactual values to estimate various types of causal effects. The idea isn’t so new or groundbreaking, but many theoretical insights have been elucidated only recently. Snowden et al’s presentation focuses on effects that average over the entire population.

These authors don’t cite it in their paper, but I think the most sophisticated application of this approach for a cross-sectional study is Jennifer Hill’s “Bayesian Nonparametric Modeling for Causal Inference” study (gated link). Hill uses the magical BART algorithm to fit a response surface and generate the imputations, from which various flavors of causal effect might be constructed (okay, full disclosure—Hill was one of my grad school teachers, but hey, BART is pretty cool). I understand that there has been a fair amount of application, at least beta-testing, of such counter-factual imputation methods in longitudinal studies as well, although I don’t have references handy.

This approach is especially appealing when you anticipate lots of measurable effect modification that you want to average-over in order to get average treatment effects. Actually, I think Snowden et al’s article does a good job of demonstrating how in some cases, it’s not classic confounding and omitted variable bias per se that is the major concern, but rather effect modification and effect heterogeneity (i.e., interaction effects) associated with variables that also affect treatment assignment. Traditional regression is clumsy in dealing with that. As far as I know conventional social science teaching, rooted as it is in constant effects models, does not have a catchy name for this kind of bias; maybe we can call it “heterogeneity bias.” Another thing that makes this kind of bias special relative to the usual kinds of confounding is that, as far as I understand, imputation-based strategies (like g-computation) that try to correct for it may in fact take advantage of measured heterogeneity associated with post-treatment variables. That is one of the reasons that these methods have appeal for longitudinal studies. (On this point, I’ll refer you to a little tutorial that I’ve written on a related set of methods—augmented inverse propensity weighting for attrition and missing data problems (link).)

Stijn Vansteelandt and Niels Keiding provide an invited commentary (gated link) on Snowden et al’s paper, and they make some really interesting points that I wanted to highlight. First, they note that imputation based strategies such as g-computation have a long history in association with the concept of “standardization.” More importantly are two points that they make later in their commentary. First is a point that Vansteelandt has made elsewhere, discussing the similarities and differences between imputation/standardization and inverse probability weighting:

The IPTW [inverse probability of treatment] approach is not commonly used in practice because of the traditional reliance on out- come-regression-based analyses, which tend to give more precise estimates. Its main virtue comes when the con- founder distribution is very different for the exposed and unexposed subjects (i.e., when there is near violation of the assumption of the experimental treatment assignment), for then the predictions made by the G-computation approach may be prone to extrapolate the association between out- come and confounders from exposed to unexposed subjects, and vice versa. The ensuing extrapolation uncer- tainty is typically not reﬂected in conﬁdence intervals for model-based standardized effect measures based on tradi- tional outcome regression models, and thus the IPTW approach may give a more honest reﬂection of the overall uncertainty (provided that the uncertainty resulting from estimation of the weights is acknowledged) (19). A further advantage of the IPTW approach is that it does not re- quire modeling exposure effect modiﬁcation by covariates and may thus ensure a valid analysis, even when effect modiﬁcation is ignored.

I think this is an exceptionally important point, making clear that the apparent “inefficiency” of IP(T)W relative to imputation based methods is, in some sense, illusory. Vansteelandt and Keiding also discuss one approach to combining imputation and IPW in order to get the best of both worlds:

We here propose a compromise that combines the beneﬁts of G-computation/ model-based standardization and of the IPTW approach. Its implementation is not more difﬁcult than the implementa- tion of these other approaches. As in the IPTW approach, the ﬁrst step involves ﬁtting a model of the exposure on relevant covariates; this would typically be a logistic re- gression model. The ﬁtted values from this model express the probability of being exposed and are commonly called ‘‘propensity scores.’’ They are used to construct a weight for each subject, which is 1 divided by the propensity score if the subject is exposed and 1 divided by 1 minus the pro- pensity score if the subject is unexposed. The second step involves ﬁtting a model, the Q-model, for the outcome on the exposure and relevant covariates but using the afore- mentioned weights in the ﬁtting procedure (e.g., using weighted least squares regression). Once estimated, the implementation detailed in the article by Snowden et al. (4) is followed; that is, counterfactual outcomes are pre- dicted for each observation under each exposure regimen by plugging a 1 and then subsequently a 0 into the ﬁtted regression model to obtain predicted counterfactual outcomes. Finally, differences (or ratios) between the aver- age predicted counterfactual outcomes corresponding to different exposure regimens are calculated to arrive at a stan- dardized mean difference (or ratio) (see reference 19 for a similar implementation in the context of attributable fractions). We refer to this compromise approach as doubly robust standardization. Here, the name doubly robust expresses that doubly robust standardized effect measures have 2 ways to give the right answer: when either the Q-model or the propensity score model is correctly speciﬁed, but not necessarily both.

This approach has been demonstrated elsewhere—e.g. a recent paper by Vansteelandt and co-authors in the journal Methodology (ungated version, gated published). I am intrigued by this because it differs from the manner in which I have implemented doubly robust estimators that combine weighting and imputation (again, see link). I wonder if there is a difference in practice.

# race, belonging, & responses to adversity (Walton et al, forthcoming)

Findings from a paper by Walton and Cohen (link), forthcoming in Science, have very intriguing implications for how differences in racial groups’ past life experiences affect how they interpret present adversity, with consequences for motivation and success:

“We all experience small slights and criticisms in coming to a new school” said Greg Walton, an assistant professor of psychology whose findings are slated for publication in the March 18 edition of Science. “Being a member of a minority group can make those events have a larger meaning,” Walton said. “When your group is in the minority, being rejected by a classmate or having a teacher say something negative to you could seem like proof that you don’t belong, and maybe evidence that your group doesn’t belong either. That feeling could lead you to work less hard and ultimately do less well.”

The paper presents results from a social experiment in which,

Those in the treatment group read surveys and essays written by upperclassmen of different races and ethnicities describing the difficulties they had fitting in during their first year at school. The subjects in the control group read about experiences unrelated to a sense of belonging…The test subjects in the treatment group were then asked to write essays about why they thought the older college students’ experiences changed. The researchers asked them to illustrate their essays with stories of their own lives, and then rewrite their essays into speeches that would be videotaped and could be shown to future students. The point was to have the test subjects internalize and personalize the idea that adjustments are tough for everyone.

Outcomes:

The researchers tracked their test subjects during their sophomore, junior and senior years. While they found the social-belonging exercise had virtually no impact on white students, it had a significant impact on black students….[G]rade point averages of black students who participated in the exercise went up by almost a third of a grade between their sophomore and senior years. And 22 percent of those students landed in the top 25 percent of their graduating class, while only about 5 percent of black students who didn’t participate in the exercise did that well. At the same time, half of the black test subjects who didn’t take part in the exercise were in the bottom 25 percent of their class. Only 33 percent of black students who went through the exercise did that poorly….[T]he black students who were in the treatment group reported a greater sense of belonging…They also said they were happier and were less likely to spontaneously think about negative racial stereotypes. And they seemed healthier: 28 percent said they visited a doctor recently, as compared to 60 percent in the control group.

Of course we need to be careful in drawing conclusions about the “effects” of race, for reasons that have been discussed at length by proponents of the “manipulability” theory of causation (link, a theory that I find persuasive). In brief, race was not subject to experimental manipulation here. Our willingness to believe this interpretation of the results is based on plausible theoretical claims, but do not cleanly arrive as a result of the experimental design. Nonetheless, the results are quite suggestive of how one’s experience as a member of a stigmatized group can affect how you interpret adversity.

HT: Kim Yi Dionne (blog), The Situationist (link).

# hidden confounding in survey experiments

I had the opportunity to participate in a fun seminar via Skype with faculty and students at Uppsala University’s department of peace and conflict research. We were discussing exciting new avenues for using experimental methods to study microfoundations of conflict and IR theories. The discussion was led by Allan Dafoe (Berkeley, visiting at Uppsala), who is doing really interesting work on reputation and strategic interaction (link).

An interesting point on “hidden confounding” in survey experiments came up that I don’t think gets enough play in analyses of survey experiments, so I thought I’d relay it here as a reference and also to see if others have any input. A common approach in a survey experiment is to provide subjects with hypothetical scenarios. The experimental treatments then consist of variations on the content of the scenarios.

What makes this kind of research so intriguing is that it would seem that you can obtain exogenous variation in circumstances that rarely obtains in the real world. Thus, if your experiment involves a scenario about an international negotiation over a dispute, you could vary, say, the regimes from which the negotiators come in a manner that does not occur frequently in the real world.

The problem is that subjects come to a survey experiment with prior beliefs about “what things go with what”—that is, about how salient features correlate. In our example, people will tend to associate regime types with things like national wealth or region. In that case, by manipulating the negotiators’ regime types in the experiment, you are implicitly changing people’s beliefs about other features of the countries from which the negotiators come. You can try to hold these things “constant”—e.g., by having one treatment where negotiator A comes from a “rich democracy” and another where negotiator A comes from a “rich dictatorship”—but to the extent that you are creating a scenario that departs from what typically occurs in the real world, you might be causing the subject to wonder whether we are talking about some “unusual” circumstance. If so, the subject might apply a different evaluative framework than what the subject would apply to “usual” circumstances. Thus, you are obtaining a causal estimate that is dependent on the frame of reference, which may not be generalizable.

It’s a bit thorny, so what are solutions? Ironically, it seems to me that one solution would be to focus the experiment on treatments that are “plausibly exogenous.” One could focus on conditions that respond easily to choices, and where choices in either direction are conceivable. Or, one could focus the experiment on things that can vary randomly—like weather, most famously. I find this ironic because it seems that the survey experiment doesn’t get us very far from what we attempt to do with natural experiments. It would seem that the sweet spot for survey experiments would be for things that we are pretty sure could occur as a natural experiment, but either haven’t occurred often enough or haven’t been measured, in which case we can’t just study the natural experiment directly. Applying this rule would greatly limit the areas of application for survey experiments, but I think this formula would result in survey experiments that have more credible causal interpretations.

(By the way, Allan clued me into a discussion of this very point in a current working paper by Michael Tomz and Jessica Weeks: link.)

UPDATE: Allan provided this initial reaction:

I actually think the problem with survey experiments is a bit worse than you describe. It’s not just that confounding can be avoided in survey experiments by focussing on those factors that are plausibly manipulable; one has to vary factors that are in the population typically uncorrelated with other factors of interest, given the scenario. That is, one wants that the respondents believe Pr(Z|X1)=Pr(Z|X2) where X1 and X2 are two values of the treatment condition, and Z is any other factor of potential relevance that is not a consequence of treatment. For example, the decision of whether the US should stay in Afghanistan (X1) is plausibly manipulable and could plausibly go either way; Obama could decide to leave (X2). But even though such a counterfactual is plausible and could involve a hypothetical manipulation, we are unlikely to believe that Pr(Z|X1)=Pr(Z|X2), where Z could be the domestic support for war, or the strength of the US economy, or the resilience of the Taliban. So perhaps this implies that the only treatments that will not generate information leakage are either (1) those that are exogenous to begin with in the world (which are thus relatively easy to study using observational data), or (2) those that provide a compelling hypothetical natural experiment to account for the variation. So in this sense—perhaps I am actually just restating your main point—survey experiments only generate clear causal inferences if the key variation arises from a credible (hypothetical) natural experiment.