Tracy L. Nolen and Michael G. Hudgens have new paper posted to JASA’s preprint website (gated link, ungated preprint) on randomization inference in situations where intermediate post-treatment outcomes are important in defining causal effects. Their motivating example is one where we want to see how a medical treatment affects people’s recovery from an infection, but infection status is something that is itself affected by the treatment. Other examples where post-treatment outcomes are important are estimating causal effects under noncompliance and related instrumental variables methods (classic paper link), as well as the “truncation by death” situation (link) for which causal effects are only meaningful for certain endogenously revealed subpopulations. In these cases, principal strata refer to subpopulations that are distinguished by intermediate potential outcomes. The key contribution here is to develop exact tests for principal strata estimation. The authors want to use exact tests, rather than asymptotic-frequentist or Bayesian approaches, because exact tests have better type-I/type-II error performance in small samples, and many principal strata situations involve making inferences on small subgroups of, possible already-small, subject pools.

To formalize their argument a bit, let refer to a subject’s treatment status, refer to a subject’s infection status (observed after treatment), and refer to a subject’s outcome given infection and treatment statuses. We are interested in the effect of treatment on progress after infection:

.

(Clearly this estimand is only meaningful for those that could be infected under either condition.) But,

and

for in treated and in control, because is endogenous to . Thus, the expression,

for in treated and in control does not estimate the effect of interest. In terms of principal strata, is an element in a sample from the mixed population of people for whom only when (the “harmed” principal stratum) or irrespective of (the “always infected” principal stratum), while is an element in a sample from the mixed population of people for whom only when (“protected”) or irrespective of (“always infected”). ~~The two mixed populations are thus different and it is reasonable to expect that treatment effects also differ across these two subpopulations. For example, imagine that “the harmed” are allergic to the treatment but otherwise very healthy, and so the treatment not only causes the infection but leads to an allergic reaction that catastrophically interferes with their bodies’ responses to infection. In contrast, suppose the “protected” are in poor health; their infection status may respond to treatment, but their general prognosis is unaffected and is always bad. Finally, suppose the “always infected” do not respond to treatment in their outcomes either. Here, on average, the treatment is detrimental due to the allergic response among “the harmed”. But if one estimates a treatment effect by comparing these two subpopulations, one may find that the treatment is on average benign. This is a made-up example, but does not seem so far-fetched at all.~~ [**Update**: Upon further reflection, I realize that the preceding illustration that appeared in the original post had a problem: it failed to appreciate that the causal effect of interest here is only properly defined for members of the “always infected” population. The point about bias still holds, but it arises because one is not simply taking difference in means between treated and control “always infected” groups, but rather between the two mixed groups described above. The problem, then, is to find a way to isolate the comparison between treated and control “always infected” groups, removing the taint introduced from the presence of the “harmed” subgroup among the treated and the “protected” subgroup among the control. This is interesting, because it is precisely the *opposite* of what one would want to isolate in a LATE IV analysis. Nonetheless, the identifying condition is the same — as discussed below, it hinges on monotonicity.]

The authors construct an exact test for the null hypothesis of no treatment effect within a given principal stratum under a *monotonicity* assumption that states that the treatment can only affect infection status in one direction (essentially, this is the same “no defier” monotonicity assumption that Angrist, Imbens, and Rubin use to identify the LATE IV estimator). This rules out the possibility of anyone being in the “harmed” group. The assumption thus allows you to bound the number of people in each of the remaining principal strata (“always infected”, “protected”, and “never infected”). Then, an exact test can be carried out that computes the maximum exact test p-value under all principal stratum assignments consistent with these bounds. The analysis can assess consequences of violations of monotonicity through a sensitivity analysis: the proportion of “harmed” can be fixed by the analysts and the test re-computed to assess robustness.

An alternative approach used in the current literature is what is called the “burden of illness” approach (BOI). BOI collapses intermediate and endpoint outcomes into a single index and then carries out an ITT analysis on this index. The authors find that their exact test on principal strata has substantially more power than BOI ITT analysis. The authors also show that Rosenbaum (2002) style covariate adjustment can be applied (regress outcomes on covariates, perform exact test with residuals), and the usual inverted exact test confidence intervals can be used, both with no added complication.

It’s a nice paper, related to some work that some colleagues and I are currently doing on randomization inference. Exact tests are fine for null hypothesis testing, but I am not at all sold on constant-effects-based inverted exact tests for confidence intervals. Certainly for moderate or large samples, there is no reason at all to use such tests, which can miss the mark badly. Maybe for small samples though you don’t really have a choice.

Mark M. FredricksonI’d like to better understand this comment:

Is your argument that we should be using a different model that connects potential outcomes Yc and Yt? E.g. Yt = tau * Yc. Are you looking to capture more than the location shift of the additive model (Yt = tau + Yc)?

Alternatively, I could also see this being a critique of the test statistic employed. The difference of means on the outcome is a common choice, but rank based statistics might be preferable or a measure of spread, instead of location.

To me, this is the attraction of randomization inference: the opportunity to mix and match models of effects with test statistics that best suit the data.

CyrusPost authorThanks for the comment, Mark. Let me unpack what I meant. I am just talking about estimating differences in means (location shift). For me the problem is with the constant effects assumption. It is usually unjustified, and when you use it to construct a confidence interval, what you are doing mathematically is equivalent to invoking a homoskedasticity assumption. If there is any non-trivial effect heterogeneity, these intervals can have poor coverage. Basically Freedman’s critique of homoskedastic regression variance estimates will apply almost exactly, even though it may feel like this estimator is somehow different. (When the data are highly skew, the equivalence to homoskedasticity is not quite right, but the intervals are still going to be off.) In large samples, the Neyman conservative variance estimator or White’s heteroskedastic robust variance estimator (they are, essentially, the same, as discussed in a previous post) are consistent for the true randomization-induced variance, and therefore the confidence intervals will have correct coverage.

I haven’t thought about how this all might apply to other kinds of effects estimators (e.g. multiplicative effects), and it might be interesting to do so. But typically (and I am sort of an Angrist and Pischke devotee on this count) the difference in means is the most relevant estimand.