A little rant on the sad state of knowledge about post-treatment bias: For some reason I still see a lot of people using control strategies (typically, regression) that use post-treatment outcomes that are intermediate between the treatment and endpoint outcome of interest. I have heard people who do so say that this is somehow necessary to show that the “effects” that they estimate in the reduced form regression of treatment on endpoint outcome are not spurious. Of course this is incorrect. To show the relationship “goes away” after controlling for the intermediate outcome does not indicate that the effect is spurious. It could just as well be that the treatment affects the endpoint outcome mostly through the intermediate outcome.
I have also heard people say that controlling for intermediate, post-treatment outcomes is somehow “conservative” because controlling for the post-treatment outcome “will only take away from the association” between the treatment and the outcome. Of course, this is also incorrect. Controlling for a post-treatment variable can easily be anti-conservative, producing a coefficient on the treatment that is substantially larger than the actual treatment effect. This happens when the intermediate outcome exhibits a “suppression” effect, for example, when the treatment has a negative association with the intermediate outcome, but the intermediate outcome then positively affects the endpoint outcome. Here is a straightforward demonstration (done in R):
In the example above, z is the treatment variable, and y is the endpoint outcome, while d is an intermediate outcome. (The data generating process resembles a binomial assignment experiment.) The causal effect of z is properly estimated in the first regression. The effect is indistinguishable from 0. The problems that arise when controlling for a post-treatment intermediate outcome are shown in the second regression. Now the coefficient on z is 1 with a very low p-value!N <- 200 z <- rbinom(N,1,.5) ed <- rnorm(N) d <- -z + ed ey <- rnorm(N) y <- z + d + ey print(coef(summary(lm(y~z))), digits=2) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.0049 0.14 0.035 0.97 z -0.1109 0.20 -0.555 0.58 print(coef(summary(lm(y~z+d))), digits=2) Estimate Std. Error t value Pr(>|t|) (Intercept) -0.078 0.093 -0.84 4.0e-01 z 1.034 0.149 6.95 5.3e-11 d 1.046 0.064 16.23 3.6e-38
UPDATE
A question I received offline was along the lines of “what if you control for the post-treatment variable and your effect estimate doesn’t change. Surely this strengthens the case that what you’ve found is not spurious.” I don’t think that is correct. The case for having a well identified effect estimate is based only on having properly addressed pre-treatment confounding. To show that a post-treatment variable does not alter the estimate has no bearing on whether this has been achieved or not. Thus, the post-treatment conditioning is pretty much useless for demonstrating that a causal relation is not spurious.
The one case where post-treatment conditioning provides some causal content is in the case of mediation. But there, exclusion restriction or effect-homogeneity assumptions have to hold, otherwise the mediation analysis may produce misleading results. On these points, I suggest looking at this very clear paper by Green, Ha, and Bullock (ungated preprint). A more elaborate paper (though not quite as intuitive in its presentation) is this one by Imai, Keele, Tingley, and Yamamoto (working paper).