Notes on multiple comparisons and pre-specifying exploratory analyses

I am part of a few research projects and initiatives involving the development of pre-analysis plans. This includes involvement in the EGAP metaketas (link) as well as some of my own research projects. Some questions that researchers frequently struggle to address include, “what kind of multiple comparisons adjustments do we need?” and “what if we are unsure about the nature of the effects that will arise and we want to allow for exploration?”. Here are some thoughts in relation to these questions:

First, I recommend that you read these two very good blog posts on how to assess multiple comparisons issues by Daniel Laekens (link) and Jan Vanhove (link). The core insight from these posts is that multiple comparisons adjustments are intended to control error rates in relation to the specific judgments that you want to make on the basis of your analysis.

Second, my understanding is that multiple comparisons adjustments and two-sided tests are ways of establishing appropriate standards precisely for exploratory analyses. For example, multiple comparisons adjustments (or index creation — link) can come into play when, ex ante, available information leaves you undecided about which outcomes might reveal an effect, but you are willing, ex post, to take a significant estimate on any outcome as being indicative of an effect. Similarly, two-sided tests come into play when, ex ante, available information leaves you undecided as to whether an effect is positive or negative, but you are willing, ex post, to take a significant negative estimate as indicative of a negative effect and a significant positive estimate as indicative of a positive effect. There is nothing wrong with being honest about such undecidedness in a pre-analysis plan, and there is nothing about pre-specification that precludes such exploration. Rather the prespecification allows you to think through an inferential strategy that is ex post reliable in terms of error rates.

Update (6/5/17):

Some discussions that followed my posting these notes made me think it would be useful to give a toy example that helps to think through some of the issues highlighted above. So here goes:

Suppose that you have a study that will produce hypothesis tests on treatment effect estimates for two outcomes, outcome A and B. So, we have two significance tests. (Most of the points made here would generalize to more tests or groupings of tests.) What kind of multiple testing adjustment would be needed? It depends on the judgments that you want to make. Here are three scenarios:

  1. My primary goal with the experiment is that I want to see if the treatment does anything, and if it does, I will proceed with further research on this treatment, although the direction of the research depends on whether I find effects on neither, A, B, or both. I selected outcomes A and B based on pretty loose ideas about how the treatment might work.
  2. My selection of outcomes A and B is based on the fact that there is a community of scholars quite invested in whether there is an effect on A, and another community of scholars invested in whether there is an effect on B. My research design allows me to test these two things. The conclusion regarding the outcome-A effect will inform how research proceeds on the “A” research program, and, distinct from that, the conclusion regarding the outcome-B effect will inform how research proceeds on the “B” program.
  3. Outcome A is the main outcome of interest. However, depending on whether there is an effect on A, I am also interested in B, as a potential mechanism.

These three scenarios each suggest a different way of treating the multiple outcomes. The judgment in scenario 1 depends on whether A or B is significant, and therefore requires multiple testing adjustment for the two outcomes so as to control this joint error rate. The two judgments in scenario 2 are independent of each other. As such, the error rate for each independent judgment depends on the error rate for each individual test — no multiple comparisons adjustment is called for. The sequence of judgments in scenario three suggests a “sequential testing plan,” along the lines of those discussed by Rosenbaum in Chapter 19 of his Design of Observational Studies. (Hat tip to Fernando Martel @fmg_twtr for this.)

The upshot is that it is not the number of outcomes that matters in and of itself, rather it is the nature of the judgments that one wants to make with tests on these outcomes that determines the adjustment needed. The goal is to control the error rate for the judgment that you are making. I get the sense that the confusion over adjustments comes from confusion over what people want to do with the results of an analysis.


Beliefs that Don’t Self-Correct

Some people hold beliefs that are false according to the most rigorous, current scientific wisdom. Take, for example, anti-vaccine types’ beliefs about the risks of vaccines (e.g. autism risk).

What’s funny is that at a societal level, we find ourselves in seemingly intractable debates over these beliefs, as if they were moral issues. It’s funny because, from a material rational perspective, at some point and at some level the beliefs should be self-correcting.

What are some explanations? One possibility is externalities. As we are all now acutely aware, in democratic systems these false beliefs can aggregate into policies that threaten even those who hold the correct beliefs. But this works the other way around too. So long as the false beliefs are held by an electoral minority, the electoral majority protects them from their foolishness.

Externalities are sometimes more intrinsic to the issue at hand. With vaccines, current scientific wisdom holds that autism risks are negligible whereas benefits in terms of protection from other diseases are substantial. This particular case is also complicated by “herd immunity,” in that you are affected by your neighbor’s vaccination decision. If the fools reside next to sophisticates, then, again, the fools are protected.

Or, it may be that the costs are borne by future generations, and so do not feed back directly to those taking the consequential decisions. Climate change has this feature: it’s the fools’ children or grandchildren who will suffer. (Although, probably, the more immediate problem is that it is the children of others in faraway places that will suffer the fools’ lack of concern.)

Generally, the externalities explanation relies on the standard public goods logic: investing in learning about rigorous scientific findings is a public good, in which case we should expect widespread underinvestment.

The externalities case is not so tight, though. It seems the US hosts an electoral majority of climate change deniers, although one could attribute this to the intergenerational and interregional externalities. But for the more intrinsic, “herd immunity” kinds of issues, the story is not so clear either. For example, my understanding is that anti-vaccine types tend to cluster in their social interactions (home-schooling and whatnot) and therefore “own” and “neighbors’” actions will tend to be highly correlated.

To fill the gaps from the externalities story, here are some other things we might consider:

  • Intrinsic complexity, such that only advanced scientific methods can penetrate these issues and we cannot appeal to direct experience, in which case beliefs depend on some degree of faith.
  • Existence of vested immediate interests against the truth and who seek to manipulate the situation.
  • Similar coordination and team-signaling dynamics as I discussed with regard to “lies, dupes, and shit tests” (.htm)

Notes on methodological individualism, ontology, and social science theorizing

From Sugden’s JEL review of Epstein’s The Ant Trap (.html):

The current “consensus view” [in philosophy] recognizes a distinction, first proposed by Lukes (1968), between explanatory individualism and ontological individualism. Explanatory individualism maintains that social facts are best explained in terms of facts about individuals and their interactions, while ontological individualism maintains that social facts are exhaustively determined by facts about individuals and their interactions. The consensus view is that explanatory individualism is a contestable claim about the most useful methodology for social science; it might be true that many social facts are best explained individualistically, but there are no good grounds for treating nonindividualistic explanations as unacceptable in principle. In contrast, the consensus view takes ontological individualism to be true—indeed, trivially true, a set of what Lukes (1968, p. 20) called “banal propositions.”

You could take the ontological point further to propose that facts about individuals are exhaustively determined by their finer moving parts (cells, all the way down to atoms), but this does not affect judgments of explanatory individualism. So don’t confuse the ontological point with evaluations of explanatory power.

Epstein thinks that these (and other) considerations do indeed support a ‘maybe, maybe not’ position towards explanatory individualism…. But he challenges the second part of the consensus view, that social properties are nothing over and above the properties of individuals.

Hmmm…so Epstein takes issue with social scientists’ emphasis on methodological individualism.

[Epstein’s] new way of thinking [“social ontology”] begins with the recognition that the actions of groups do not necessarily supervene on [that is, reduce immediately to,] the actions of group members. That allows us to understand that particular kinds of groups (legislatures being an example) can be set up to achieve particular purposes, and that thousands of years of sociality have endowed human beings with strategies for “improving the design of groups, helping to ensure that they accomplish their purposes” (pp. 234–35). In the case of a legislature, the [group-level] factors that can be manipulated [while keeping the individuals constant] include its rules for aggregating votes and the rules by which its members are elected.

Sugden is not convinced, at least not convinced that social scientists’ emphasis on methodological individualism has caused them to overlook such possibilities:

Of course, Epstein is right about this. But if these are new ideas for social ontology, they are not new for social science.

So even if social scientists’ principles are not stated in a complete fashion, it is not clear that there has been an important consequence to that. This leads Sugden to wonder, does progress in social science really require that ontological questions like this are sorted out?

If [a] counterfeit dollar is not detectably different from the real thing, both types of bills will serve interchangeably as a common medium of exchange. A natural way of modeling this scenario—say, as part of an explanation of changes in the general price level in an economy with forgery—would be to use a model in which paper money is a homogeneous commodity, produced both by the Bureau of Engraving and Printing and by the Mafia. In a model of this kind, the concept of “money” does not correspond with…ontological accounts of what money really is [emph. added]. But in deciding to use this modeling strategy, the modeler does not need to engage in ontological analysis of the true nature of money. The rationale for amalgamating the two types of bills into a common category comes from economic reasoning about the properties of trade and from an understanding of what the model is designed to do.

Lies, Dupes, and Shit Tests

Leaders sometimes tell outrageous lies. Maybe they are channeled through “fake news” sites. Of course fake news has been around forever — e.g., we’ve long spoken of “government mouthpieces” and propaganda. How could leaders get away with such lies?

A great read on this is a working paper by Andrew Little: .pdf Here is a summary: .html. Andrew’s theory supposes that some segment of the population are dupes — gullible enough to fall for the lies. Of course if most people are dupes, there is not much of a story to tell (though not to say it isn’t accurate). But the world also contains sophisticates who can see through the lies. The dupes could nonetheless induce sophisticates to express belief in the lie (even if privately the sophisticates think differently). The reason is that the sophisticates share a sense of the need for consensus. So they will go along, not because they believe or even like the lie, but because they would simply prefer to be part of the consensus.

A complementary way to think about this does not require there to be any dupes at all. I call this the “shit test” explanation. Suppose everyone is a sophisticate. But for at least some of these sophisticates, it is important to be in good standing with the leader. Then the leader can say something outrageous, but this serves as a shit test: the leader can use this to check people’s commitment to the leader. If you are truly devoted, you will “swallow the shit.” The more outrageous the lie, the better as a shit test (although maybe the leader cannot push it too far). The way people respond therefore conveys information about whether they are with the leader or not. The leader’s allies might also beat other people over the head with the shit: are you with us or against us? You would expect separation in people’s expressed support for the lie on the basis of their degree of attachment to the leader or the degree to which they feel compelled to go along with the leader’s allies. An implication of the shit test theory is that a mirror phenomenon is also possible: when the leader speaks truths those who are alienated from the leader may have reason to deny those truths.


Election questions

  1. R votes were about what they were in the past. What we really need to know is whether these are almost entirely people who have chosen R in the past. If yes, then the big question for Rs is “why would they accept him?” If it’s lots of new R votes, compensating for lots of Rs who didn’t vote for R again, then the question is “why these new Rs for this election?”

  2. D votes are down relative to the past. What we really need to know is whether this is primarily because lots who voted D in the past either stayed home didn’t vote (whether by choice or because of suppression) or voted third party. If yes, then the big question is “why didn’t they vote for D again?”

  3. If neither of the above accounts for what happened, then the implication is that a non-negligible share of people who had voted D in the past were actually comparing what R and D had to offer, and at least in a select set of districts in a select set of states, chose R. Then the big question is “why would they switch?”

My current belief, based on results (vote shares, vote share swings, and vote totals) and my own understanding about voters, is that 3 is unimportant, asking about the relative appeal of the R vs D candidates is irrelevant, and better understanding would come from examining why Rs do what they do as Rs and Ds as Ds. But my belief about this would change if I saw individual-level survey data or voter file data suggesting that 3 is in fact important.