An alternative to conventional journals and peer review: the “proceedings” model

Agonizing over peer review is a perennial theme in conversations among scholars. I have given this some thought, and in this attached document, I propose an alternative “proceedings” model for publication in political science, my home discipline: [PDF]

A point that I make in the document is that, in other disciplines like computer science, proceedings-type publications are the highest-prestige outlets, and conventional journals are considered second-tier. So, there is nothing essential about conventional journals for granting prestige.

Something that I do not make explicit, but is implicit, in this model, is that there is ample scope for scholars to be entrepreneurial in organizing new events, perhaps even one-off events or short-termed series, that generate new proceedings outlets. An overarching governing body (like an APSA section) could serve to “certify” such proceedings. This would be an alternative to “special issues” of journals that are sometimes arranged to serve a similar purpose, but again tend to be bogged down unnecessarily by hurdles associated with operating through conventional publication processes.

P.S.: For those interested in models of publication alternative to the conventional closed-review, closed-access formats, here are two to consider:

  • NIPS Proceedings (link) are the peer-reviewed proceedings of the annual Neural Information Processing Systems conference, a major forum for advances in machine learning. Note that papers are posted along with their reviews.
  • Theoretical Economics (link) is an open-source journal focusing on economic theory and published by the Econometric Society. Note that they host using software generated by Open Journal Systems of the Public Knowledge Project (link).

Theoretical models and RCTs

In my research, I typically try to inform decisions on the design of policies. Sometimes this amounts to a binary “adopt” / “do not adopt” decision, but usually it is more complicated than that. To the extent that it is, I would like to have an experiment that sets me up to inform the more complicated decision. This often requires that I pose some kind of theoretical model that relates a range of options for policy inputs to outcomes of interest.

To the furthest extent possible, I would like my experimental design to deliver estimates of key parameters in the model with minimal additional assumptions needed when it comes to analysis. That is, I want as many of the identifying assumptions that I need to be guaranteed by my design. This is, in essence, the approach developed by Chassang et al. in the context of treatments that work only if recipients put in some effort to make them work. Here is a link to the paper: [published] [ungated PDF]. In this, the model informs the design in an ex ante manner. Ex post, after the experimental data are in, we can just estimate some simple conditional means to get the parameters of interest. A la Rubin (link), design trumps analysis; the revision is that it is model-informed design.

Now, sometimes I cannot do everything that I want in my design. For example, suppose my theoretical model suggests a potentially nonlinear relationship between inputs and outcomes. Suppose as well that I can only assign treatment to a few points on the potential support of the inputs (maybe even just two points). Then, I may need to do more with the analysis to get a sense of what outcomes might look like in areas of the support of the inputs where I have no direct evidence. This would be important if we want to propose optimal policies over the full support of inputs levels. We could take as an example this approach by Banerjee et al., who try to estimate the optimal allocation of police posts to reduce drunk driving: [ungated PDF].

(These issues are central to a working group that Leonard Wantchekon and I are now running for NYC-area economists and political scientists. We had our first event last week at Princeton and it was great! This post is inspired by the thought provoking talks given by Erik Snowberg, Brendan Kline, Pierre Nguimpeu, and Ethan Bueno de Mesquita at that event.)


Two tales about a null result

(See the bottom for a bit of a plot twist.)

Tale 1

A researcher has some intuitions about how to test a theoretical proposition using what appears to be a nice natural experiment. Before discussing things with other colleagues and without putting much more thought into it, the researcher pulls together some data and does some analyses. The idea is just to see whether there are any interesting patterns to pursue further. The researcher tries out different outcome measures. The researcher also tries out different specifications that come to mind based on deeper, iterative reflection on the problem. All results come back noisy, with no clear patterns—it’s a null result. The researcher declares that the either the proposition or the natural experiment has flaws that were not apparent at the outset.

Tale 2

A researcher has some intuitions about how to test a theoretical proposition using what appears to be a nice natural experiment. The researcher discusses the idea with some colleagues who agree that this has potential as a natural experiment and also addresses a question in a research program that draws a lot of interest. The research thinks harder about the theory and the natural experiment, finding some subtleties that make the analysis of the natural experiment more informative about the theoretical proposition. The researcher writes up the theoretical analysis giving rise to the proposition, the various tests that the proposition implies, and the data analysis plan and presents them for comment at workshops and conferences. The researcher receives suggestions and generally positive feedback about the value of pursuing the project. This is somewhat unusual — in the past, ideas have been shot down either at the more inchoate stage or even at workshops based on pre-analysis plans. So the researcher is excited to carry out the analysis. All results come back noisy, with no clear patterns—it’s a null result. The researcher declares that the either the proposition or the natural experiment has flaws that were not apparent at the outset.


How do these tales end?

Probably for tale 1, the researcher just abandons the project. Should there be a record available of this? I suppose, insofar as the absence of a record is “file drawering.” But the project was exploratory and informal, so I do not think it is reasonable that a full-on journal publication would be warranted. Rather, what would be nice would be a place to deposit logs of such exploratory analyses along with an abstract and keywords. Then, others interested in the pursuing similar ideas would have a place to search and learn from others attempts. The researcher could add a line to the CV in a section called “research logs.”

Tale 2 is different. It is the process leading up to the null that makes it different. Given the “peer review” in the pre-analysis stage, the findings should be of interest to that community of researchers. The null finding merits a full-on journal publication.

Plot twist

Nothing about the results being “null” in the above affects the logic of the story. Indeed, I focused only on null results here because I wanted a story that shows how we can feel good about a null result. That’s the second tale. However, if we made the results “significant” instead, it does not follow that one should, as a result of that change, now “feel good” in the first tale.


Notes on multiple comparisons and pre-specifying exploratory analyses

I am part of a few research projects and initiatives involving the development of pre-analysis plans. This includes involvement in the EGAP metaketas (link) as well as some of my own research projects. Some questions that researchers frequently struggle to address include, “what kind of multiple comparisons adjustments do we need?” and “what if we are unsure about the nature of the effects that will arise and we want to allow for exploration?”. Here are some thoughts in relation to these questions:

First, I recommend that you read these two very good blog posts on how to assess multiple comparisons issues by Daniel Laekens (link) and Jan Vanhove (link). The core insight from these posts is that multiple comparisons adjustments are intended to control error rates in relation to the specific judgments that you want to make on the basis of your analysis.

Second, my understanding is that multiple comparisons adjustments and two-sided tests are ways of establishing appropriate standards precisely for exploratory analyses. For example, multiple comparisons adjustments (or index creation — link) can come into play when, ex ante, available information leaves you undecided about which outcomes might reveal an effect, but you are willing, ex post, to take a significant estimate on any outcome as being indicative of an effect. Similarly, two-sided tests come into play when, ex ante, available information leaves you undecided as to whether an effect is positive or negative, but you are willing, ex post, to take a significant negative estimate as indicative of a negative effect and a significant positive estimate as indicative of a positive effect. There is nothing wrong with being honest about such undecidedness in a pre-analysis plan, and there is nothing about pre-specification that precludes such exploration. Rather the prespecification allows you to think through an inferential strategy that is ex post reliable in terms of error rates.

Update (6/5/17):

Some discussions that followed my posting these notes made me think it would be useful to give a toy example that helps to think through some of the issues highlighted above. So here goes:

Suppose that you have a study that will produce hypothesis tests on treatment effect estimates for two outcomes, outcome A and B. So, we have two significance tests. (Most of the points made here would generalize to more tests or groupings of tests.) What kind of multiple testing adjustment would be needed? It depends on the judgments that you want to make. Here are three scenarios:

  1. My primary goal with the experiment is that I want to see if the treatment does anything, and if it does, I will proceed with further research on this treatment, although the direction of the research depends on whether I find effects on neither, A, B, or both. I selected outcomes A and B based on pretty loose ideas about how the treatment might work.
  2. My selection of outcomes A and B is based on the fact that there is a community of scholars quite invested in whether there is an effect on A, and another community of scholars invested in whether there is an effect on B. My research design allows me to test these two things. The conclusion regarding the outcome-A effect will inform how research proceeds on the “A” research program, and, distinct from that, the conclusion regarding the outcome-B effect will inform how research proceeds on the “B” program.
  3. Outcome A is the main outcome of interest. However, depending on whether there is an effect on A, I am also interested in B, as a potential mechanism.

These three scenarios each suggest a different way of treating the multiple outcomes. The judgment in scenario 1 depends on whether A or B is significant, and therefore requires multiple testing adjustment for the two outcomes so as to control this joint error rate. The two judgments in scenario 2 are independent of each other. As such, the error rate for each independent judgment depends on the error rate for each individual test — no multiple comparisons adjustment is called for. The sequence of judgments in scenario three suggests a “sequential testing plan,” along the lines of those discussed by Rosenbaum in Chapter 19 of his Design of Observational Studies. (Hat tip to Fernando Martel @fmg_twtr for this.)

The upshot is that it is not the number of outcomes that matters in and of itself, rather it is the nature of the judgments that one wants to make with tests on these outcomes that determines the adjustment needed. The goal is to control the error rate for the judgment that you are making. I get the sense that the confusion over adjustments comes from confusion over what people want to do with the results of an analysis.


Beliefs that Don’t Self-Correct

Some people hold beliefs that are false according to the most rigorous, current scientific wisdom. Take, for example, anti-vaccine types’ beliefs about the risks of vaccines (e.g. autism risk).

What’s funny is that at a societal level, we find ourselves in seemingly intractable debates over these beliefs, as if they were moral issues. It’s funny because, from a material rational perspective, at some point and at some level the beliefs should be self-correcting.

What are some explanations? One possibility is externalities. As we are all now acutely aware, in democratic systems these false beliefs can aggregate into policies that threaten even those who hold the correct beliefs. But this works the other way around too. So long as the false beliefs are held by an electoral minority, the electoral majority protects them from their foolishness.

Externalities are sometimes more intrinsic to the issue at hand. With vaccines, current scientific wisdom holds that autism risks are negligible whereas benefits in terms of protection from other diseases are substantial. This particular case is also complicated by “herd immunity,” in that you are affected by your neighbor’s vaccination decision. If the fools reside next to sophisticates, then, again, the fools are protected.

Or, it may be that the costs are borne by future generations, and so do not feed back directly to those taking the consequential decisions. Climate change has this feature: it’s the fools’ children or grandchildren who will suffer. (Although, probably, the more immediate problem is that it is the children of others in faraway places that will suffer the fools’ lack of concern.)

Generally, the externalities explanation relies on the standard public goods logic: investing in learning about rigorous scientific findings is a public good, in which case we should expect widespread underinvestment.

The externalities case is not so tight, though. It seems the US hosts an electoral majority of climate change deniers, although one could attribute this to the intergenerational and interregional externalities. But for the more intrinsic, “herd immunity” kinds of issues, the story is not so clear either. For example, my understanding is that anti-vaccine types tend to cluster in their social interactions (home-schooling and whatnot) and therefore “own” and “neighbors’” actions will tend to be highly correlated.

To fill the gaps from the externalities story, here are some other things we might consider:

  • Intrinsic complexity, such that only advanced scientific methods can penetrate these issues and we cannot appeal to direct experience, in which case beliefs depend on some degree of faith.
  • Existence of vested immediate interests against the truth and who seek to manipulate the situation.
  • Similar coordination and team-signaling dynamics as I discussed with regard to “lies, dupes, and shit tests” (.htm)