Toward a norm of results-free peer review and “ex ante science”

Vox recently posted an article on “problems facing science” (link). A panel of 270 scientists from across a range of disciplines chimed in. A major theme, and arguably the biggest problem identified after issues related to accessing grants, was that “bad incentives” undermine scientific integrity. Specifically, these bad incentives arise because publication and grant decisions tend overwhelmingly to be based on assessments of whether research results are “exciting.” Vox also reported that the “fix” for this problem, as suggested by many of the panelists, was for editors and reviewers to “put a greater emphasis on rigorous methods and processes rather than splashy results.”

Recently, Comparative Political Studies hosted a special issue dedicated to applying a results-free review process (link). The editors of this special issue concluded that the process promoted attention to “theoretical consistency and substantive importance.” It introduced some complications too, such as questions about how to handle statistically insignificant results and how to accommodate research designs other than experiments or certain types of observational templates. But generally, they concluded that the process “exceeded our expectations.”

These two articles reference other detailed arguments promoting the idea of review based on whether hypotheses are well motivated and methods rigorously applied. I have also elaborated on why I think this kind of “ex ante science” is a good idea (link1 link2). The principles of “ex ante science” are to evaluate the value of applied empirical research contributions on the basis of whether the empirical analyses are well motivated in substantive or theoretical terms, whether the empirical methods are tightly derived from the substantive motivation, and whether the proposed empirical methods are robust. One avoids referencing results in judging the value of the contribution.

Here I want to suggest something that we can start doing immediately to promote this goal: voluntary commitment by journal reviewers to evaluate manuscripts on the basis of principles of ex ante science. Journal editors give reviewers discretion to apply their judgment in evaluating a manuscript. This grants a license to those interested in promoting the principles of ex ante science to do just that.

Here are some operational guidelines. As a reviewer you could begin by masking results prior to starting to read a manuscript. Then, you could structure your review so that it addresses the questions pertaining to the principles stated above.

Let’s take it even further, in the interest of promoting a norm of reviews based on principles of ex ante science: To resolve any ambiguity about one’s commitments to these principles, as a reviewer make it explicit. Reviews could begin with a declaration along the lines of “This review is based on assessments of whether or not the empirical analyses are well motivated and the empirical methods robust. Results were masked in judging the merits of the manuscript.”


Inverse covariance weighting versus factor analysis

These are two ways to take a bunch of variables that are supposed to measure common latent factors and reduce them to a single or a few indices. What is the difference? I get the question fairly often, so I thought I’d put this post up.

The two approaches do different things. Inverse covariance weighting applies an assumption that there is one latent trait of interest, and constructs an optimal weighted average on the basis of that assumption. Factor analysis tries to partial out an array of orthogonal latent factors.

An intuitive way to think of it is like this:

Suppose you have data that consists of three variables: College Math Grade, Math GRE, and Verbal GRE. The two math variables will be highly correlated, and the verbal variable will be somewhat correlated with the math scores.

The inverse covariance weighted average of these three variables would result in an index that gives about 25% weight to each math score and then 50% weight to the verbal score. It “rewards” the verbal score for providing new information that the math scores don’t. The resulting index could be interpreted as a “general scholastic aptitude” index.

A factor analysis of these three variables would yield two orthogonal factors, the first factor of which would give almost 50% weight to each math variable and almost zero weight to the verbal variable, and the second would give almost zero weight to each math variable and almost 100% weight to the verbal variable. So you would get a “pure math” factor and a “pure verbal” factor.

Which one is better? It depends on the goals of your analysis.

I discuss this a bit more in my lecture on “measurement” in the quant field methods class (see links at top right). There is some R code there to play around with these concepts too.


Reasons for experiments in policy research that have little to do with statistics

We have all heard the various statistical reasons for experimental evidence to be given special consideration in policy research. For example, JPAL has nice resources covering such points (link), such as the need to balance unobserved confounders.

But when speaking to those involved in designing and implementing policies, I also point to two considerations that are not really statistical so much as sociological:

1. Putting manipulability to the test

As it happens, a randomized experiment is not necessarily the most efficient manner to obtain a consistent estimate of a causal effect. See, e.g., Kasy’s research (link) or this discussion recently on the Development Impact blog (link). Of course, the non-randomized alternatives do share in common with RCTs the fact that treatments are manipulated and therefore not the products of endogenous selection. It is such manipulation, not whether or not it is applied according to randomization, that is the essence of an “experiment.”

We have the famous quote from Box (1966, link):

To find out what happens to a system when you interfere with it you have to interfere with it (not just passively observe it).

This I would say is the essential argument in favor of experimentation for policy research.

Whether one or another intervention is likely to be more effective depends both on the relevant mechanisms driving outcomes and, crucially, whether the mechanisms can be meaningfully affected through intervention. It is in addressing the second question that experimental studies are especially useful. Various approaches, including both qualitative and quantitative, are helpful in identifying important mechanisms that drive outcomes. But experiments can provide especially direct evidence on whether we can actually do anything to affect these mechanisms — that is, experiments put “manipulability” to the test.

The successful use of experiments in policy research typically requires drawing on insights from other research on relevant mechanisms. This other research defines debates about what policy makers should do and how they should do it. Experiments have a distinct role in such debates by clarifying what is materially possible.

Related to this is replicability. What is nice about an experiment is that, in principle, you should have before you a recipe for recreating an effect. Context-dependence means that replicability may sometimes be elusive in practice. One could measure scientific success in the ability to fashion complete recipes (including the contextual conditions) for replicating effects. Observational studies are often deficient in this regard because we cannot control where and when we get the variation in treatments of interest. We are left to wonder whether we have really mastered what the observational data imply about causal effects. It’s possible that we have not mastered it at all and have merely tricked ourselves (or others!) into believing that certain causal effects are evident. With a complete experimental recipe we can test it.

2. Deep engagement

Experimental evaluations of policies or programs are prospective. As such they typically require deep engagement between researchers and implementers in processes of policy formulation, beneficiary selection, and site selection. Compare this to an ex post analysis. In an ex post analysis, such details are often lost. It is for good reason then that you often hear from practitioners that ex post evaluators did not understand “what really went on” in the program. They weren’t there from the beginning. In my experience, this is much less the case for experimental studies. Working prospectively, the researcher is there operating alongside implementation. The experimental method typically defines beneficiary selection.

Finally, constructing the experiment requires programmatic goals to be made concrete. Such concreteness is needed for defining interventions crisply and devising outcome measures. In my experiences, implementing partners have found it useful to go through the process of making interventions and outcomes concrete. Often the process of doing so for the purposes of an experimental evaluation was the first time they had to think so precisely about interventions and outcomes. It is a good disciplining device. It helps to make clear what is really at stake.

Of course these two factors should be taken alongside some of the limitations of experiments, which mix statistical and sociological considerations. Experiments face timescale challenges, since we can often only sustain experimental variation in treatment differentiation for so long, whether due to ethical or program cycle reasons. They also face spatial scale challenges: it can be impractical to develop well-powered experiments for macro level institutions or programs that cover large areas. Finally, there is the logistical complexity of experiments, given all the up front decisions that they require. (I do not include external validity as a distinct challenge for experiments as external validity is an issue that all research typically faces.)

Nonetheless, it is useful to have these ideas articulated to see how experimentation is about a lot more than balance on unobservables.


Philosophical foundations for design-based inference

“[Here are] two questions about ravens:

  • The general raven question: What is the proportion of blackness among ravens?
  • The specific raven question: Is it the case that 100 percent of ravens are black?

Consider a particular observation of a white shoe. Does it tell us anything about the raven color? It depends on what procedure the observation was part of. If the white shoe was encountered as part of a random sample of nonblack things, then it is evidence. It is just one data point, but it is a nonblack thing that turned out not to be a raven. It is part of a sample that we can use to answer the specific question (though not the general question), and work out whether there are nonblack ravens. But if the very same white shoe is encountered in a sample of nonravens, it tells us nothing. The observation is now part of a procedure that cannot answer either question.

The same is true with observations of black ravens. If we see a black raven in a random sample of ravens, it is informative. It is just one data point, but it is part of a sample that can answer our questions. But the same black raven tells us nothing about our two raven questions if it is encountered in a sample of black things; there is no way to use such a sample to answer either question. The role of procedures is fundamental; an observation is only evidence if it is embedded in the right kind of procedure.”
(Godfrey-Smith 2003, pp. 215-216).

This passage on the “procedural naturalism” view of science is from Peter Godfrey-Smith’s book-length survey of current debates in the philosophy of science: amazon.

When you write a pre-analysis plan, this is how you should be thinking. And you should be relating it to theoretical propositions (what the two questions about ravens are standing in for).