In the advance access pages of the American Journal of Epidemiology, Jonathan M. Snowden, Sherri Rose, and Kathleen M. Mortimer have a nice tutorial on what they refer to as “G-computation” for causal inference with observational data (ungated link). An average causal effect for a binary treatment can be defined as the average of individual level differences between the outcome that obtains when one is in treatment versus in control. Because people are either in treatment or control, one of these two “potential” outcomes is unobserved, or missing (within subjects designs do not overcome this, because the ordering of treatment assignment is itself another dimension of treatment). Given this “missing data” problem, G-computation refers to fitting models to available data that allow you to impute (i.e., predict) unobserved counterfactual values. You can then use this complete set of counterfactual values to estimate various types of causal effects. The idea isn’t so new or groundbreaking, but many theoretical insights have been elucidated only recently. Snowden et al’s presentation focuses on effects that average over the entire population.

These authors don’t cite it in their paper, but I think the most sophisticated application of this approach for a cross-sectional study is Jennifer Hill’s “Bayesian Nonparametric Modeling for Causal Inference” study (gated link). Hill uses the magical BART algorithm to fit a response surface and generate the imputations, from which various flavors of causal effect might be constructed (okay, full disclosure—Hill was one of my grad school teachers, but hey, BART *is* pretty cool). I understand that there has been a fair amount of application, at least beta-testing, of such counter-factual imputation methods in longitudinal studies as well, although I don’t have references handy.

This approach is especially appealing when you anticipate lots of measurable effect modification that you want to average-over in order to get average treatment effects. Actually, I think Snowden et al’s article does a good job of demonstrating how in some cases, it’s not classic confounding and omitted variable bias per se that is the major concern, but rather effect modification and effect heterogeneity (i.e., interaction effects) associated with variables that also affect treatment assignment. Traditional regression is clumsy in dealing with that. As far as I know conventional social science teaching, rooted as it is in constant effects models, does not have a catchy name for this kind of bias; maybe we can call it “heterogeneity bias.” Another thing that makes this kind of bias special relative to the usual kinds of confounding is that, as far as I understand, imputation-based strategies (like g-computation) that try to correct for it may in fact take advantage of measured heterogeneity associated with *post*-treatment variables. That is one of the reasons that these methods have appeal for longitudinal studies. (On this point, I’ll refer you to a little tutorial that I’ve written on a related set of methods—augmented inverse propensity weighting for attrition and missing data problems (link).)

Stijn Vansteelandt and Niels Keiding provide an invited commentary (gated link) on Snowden et al’s paper, and they make some really interesting points that I wanted to highlight. First, they note that imputation based strategies such as g-computation have a long history in association with the concept of “standardization.” More importantly are two points that they make later in their commentary. First is a point that Vansteelandt has made elsewhere, discussing the similarities and differences between imputation/standardization and inverse probability weighting:

The IPTW [inverse probability of treatment] approach is not commonly used in practice because of the traditional reliance on out- come-regression-based analyses, which tend to give more precise estimates. Its main virtue comes when the con- founder distribution is very different for the exposed and unexposed subjects (i.e., when there is near violation of the assumption of the experimental treatment assignment), for then the predictions made by the G-computation approach may be prone to extrapolate the association between out- come and confounders from exposed to unexposed subjects, and vice versa. The ensuing extrapolation uncer- tainty is typically not reﬂected in conﬁdence intervals for model-based standardized effect measures based on tradi- tional outcome regression models, and thus the IPTW approach may give a more honest reﬂection of the overall uncertainty (provided that the uncertainty resulting from estimation of the weights is acknowledged) (19). A further advantage of the IPTW approach is that it does not re- quire modeling exposure effect modiﬁcation by covariates and may thus ensure a valid analysis, even when effect modiﬁcation is ignored.

I think this is an exceptionally important point, making clear that the apparent “inefficiency” of IP(T)W relative to imputation based methods is, in some sense, illusory. Vansteelandt and Keiding also discuss one approach to combining imputation and IPW in order to get the best of both worlds:

We here propose a compromise that combines the beneﬁts of G-computation/ model-based standardization and of the IPTW approach. Its implementation is not more difﬁcult than the implementa- tion of these other approaches. As in the IPTW approach, the ﬁrst step involves ﬁtting a model of the exposure on relevant covariates; this would typically be a logistic re- gression model. The ﬁtted values from this model express the probability of being exposed and are commonly called ‘‘propensity scores.’’ They are used to construct a weight for each subject, which is 1 divided by the propensity score if the subject is exposed and 1 divided by 1 minus the pro- pensity score if the subject is unexposed. The second step involves ﬁtting a model, the Q-model, for the outcome on the exposure and relevant covariates but using the afore- mentioned weights in the ﬁtting procedure (e.g., using weighted least squares regression). Once estimated, the implementation detailed in the article by Snowden et al. (4) is followed; that is, counterfactual outcomes are pre- dicted for each observation under each exposure regimen by plugging a 1 and then subsequently a 0 into the ﬁtted regression model to obtain predicted counterfactual outcomes. Finally, differences (or ratios) between the aver- age predicted counterfactual outcomes corresponding to different exposure regimens are calculated to arrive at a stan- dardized mean difference (or ratio) (see reference 19 for a similar implementation in the context of attributable fractions). We refer to this compromise approach as doubly robust standardization. Here, the name doubly robust expresses that doubly robust standardized effect measures have 2 ways to give the right answer: when either the Q-model or the propensity score model is correctly speciﬁed, but not necessarily both.

This approach has been demonstrated elsewhere—e.g. a recent paper by Vansteelandt and co-authors in the journal *Methodology* (ungated version, gated published). I am intrigued by this because it differs from the manner in which I have implemented doubly robust estimators that combine weighting and imputation (again, see link). I wonder if there is a difference in practice.

Cyrus, Glad you enjoyed the paper! I think this is the first time I’ve seen my work on a blog. I’ve done quite a bit of causal inference teaching as a graduate student, and students always seemed surprised at how relatively easy it is to get a basic understand of g-computation/MLE. My coauthors approached me with the idea to write a tutorial-type paper on the topic as they had experienced similar things. So that was the original intent of the paper, not necessarily to advocate for a certain method, but to provide an accessible introduction and cursory explanation of the method.

We discuss some of the issues you highlight from Vansteelandt and Keiding in our reply to their invited commentary (ungated bit.ly/erJ22B). In response to the first point you highlight, extrapolation/positivity violation is an issue for all estimators, g-computation/MLE included. IPTW can have serious issues with inefficiency since it is not a substitution estimator, thus it ignores global constraints, harming finite sample efficiency (particularly when the data has sparsity issues). MLE is a substitution estimator, however it suffers from a nonoptimal bias-variance tradeoff for the parameter of interest. (Using a targeted updating step, as in targeted MLE methodology, reduces the bias of the nontargeted MLE.) The doubly robust standardization Vansteelandt and Keiding discuss can be found in Kang & Schafer 2007 and Robins et al. 2007 (see our reply link above for full citations). An upcoming chapter first authored by Jas Sekhon in our forthcoming Targeted Learning book includes additional Kang & Schafer simulations with more estimators, including other doubly robust estimators (reproduces their simulations identically as well as modifications with increased positivity violations). As these simulations indicate, and the theory backs up, not all doubly robust estimators are created equal! An expanded version of this chapter is also under review at a journal, and assuming it is accepted, I’d be happy to come back and leave you a link if you’re interested in reading it, especially since the book doesn’t come out until June.

Since intent can be hard to assess online, I will explicitly state that my comments are meant only to contribute to the discussion and I hope they would not be misinterpreted in any other way! If I misunderstood something you wrote I apologize in advance. (Can you tell this is the first time I am replying to a blog post?) Sherri