I am part of a few research projects and initiatives involving the development of pre-analysis plans. This includes involvement in the EGAP metaketas (link) as well as some of my own research projects. Some questions that researchers frequently struggle to address include, “what kind of multiple comparisons adjustments do we need?” and “what if we are unsure about the nature of the effects that will arise and we want to allow for exploration?”. Here are some thoughts in relation to these questions:
First, I recommend that you read these two very good blog posts on how to assess multiple comparisons issues by Daniel Laekens (link) and Jan Vanhove (link). The core insight from these posts is that multiple comparisons adjustments are intended to control error rates in relation to the specific judgments that you want to make on the basis of your analysis.
Second, my understanding is that multiple comparisons adjustments and two-sided tests are ways of establishing appropriate standards precisely for exploratory analyses. For example, multiple comparisons adjustments (or index creation — link) can come into play when, ex ante, available information leaves you undecided about which outcomes might reveal an effect, but you are willing, ex post, to take a significant estimate on any outcome as being indicative of an effect. Similarly, two-sided tests come into play when, ex ante, available information leaves you undecided as to whether an effect is positive or negative, but you are willing, ex post, to take a significant negative estimate as indicative of a negative effect and a significant positive estimate as indicative of a positive effect. There is nothing wrong with being honest about such undecidedness in a pre-analysis plan, and there is nothing about pre-specification that precludes such exploration. Rather the prespecification allows you to think through an inferential strategy that is ex post reliable in terms of error rates.
Some discussions that followed my posting these notes made me think it would be useful to give a toy example that helps to think through some of the issues highlighted above. So here goes:
Suppose that you have a study that will produce hypothesis tests on treatment effect estimates for two outcomes, outcome A and B. So, we have two significance tests. (Most of the points made here would generalize to more tests or groupings of tests.) What kind of multiple testing adjustment would be needed? It depends on the judgments that you want to make. Here are three scenarios:
- My primary goal with the experiment is that I want to see if the treatment does anything, and if it does, I will proceed with further research on this treatment, although the direction of the research depends on whether I find effects on neither, A, B, or both. I selected outcomes A and B based on pretty loose ideas about how the treatment might work.
- My selection of outcomes A and B is based on the fact that there is a community of scholars quite invested in whether there is an effect on A, and another community of scholars invested in whether there is an effect on B. My research design allows me to test these two things. The conclusion regarding the outcome-A effect will inform how research proceeds on the “A” research program, and, distinct from that, the conclusion regarding the outcome-B effect will inform how research proceeds on the “B” program.
- Outcome A is the main outcome of interest. However, depending on whether there is an effect on A, I am also interested in B, as a potential mechanism.
These three scenarios each suggest a different way of treating the multiple outcomes. The judgment in scenario 1 depends on whether A or B is significant, and therefore requires multiple testing adjustment for the two outcomes so as to control this joint error rate. The two judgments in scenario 2 are independent of each other. As such, the error rate for each independent judgment depends on the error rate for each individual test — no multiple comparisons adjustment is called for. The sequence of judgments in scenario three suggests a “sequential testing plan,” along the lines of those discussed by Rosenbaum in Chapter 19 of his Design of Observational Studies. (Hat tip to Fernando Martel @fmg_twtr for this.)
The upshot is that it is not the number of outcomes that matters in and of itself, rather it is the nature of the judgments that one wants to make with tests on these outcomes that determines the adjustment needed. The goal is to control the error rate for the judgment that you are making. I get the sense that the confusion over adjustments comes from confusion over what people want to do with the results of an analysis.