# Why positive theorists should love preregistration & ex ante science

Over at the IQSS Social Science Statistics blog, Richard Nielsen had a great post on pre-registration (link). He writes,

In response to a comment by Chris Blattman, the Givewell blog has a nice post with “customer feedback” for the social sciences. Number one on the wish-list is pre-registration of studies to fight publication bias — something along the lines of the NIH registry for clinical trials. I couldn’t agree more. I especially like that Givewell’s recommendations go beyond the usual call for RCT registration to suggest that we should also be registering observational studies.

As Richard notes, much of the interest in pre-registration is to reduce the publication bias that most certainly afflicts us (evidence from Gerber and Malhotra here). As a result, like in medicine, most published research findings are probably false (link). Some more arguments in favor of pre-registration to control publication bias for both clinical trials and observational studies are given by Neuroskeptic (link).

What I want to propose that the pre-registration, and the “ex ante science” ideal on which it is based, is great for positive theorists, especially for those who want to do mostly or even only positive theory. How so? For two reasons. First, theory and substantive insights are what editorial boards would need to judge the quality and relevance of research questions and hypotheses. Second, ex ante science provides a steadier stream of puzzles that theorists can delight in trying to work out.

A central role in publication decision processes

Obviously, the quality and relevance of hypotheses ought to be taken into account in judging whether a submission is publication worthy. For a causal or descriptive analysis, identification alone is insufficient to make a study worth one’s time. We want to know if the questions being asked are important and whether the hypotheses are coherent. Consider a journal publication decision mechanism that applies the logic of ex ante science. A proposed submission comes in with hypotheses and research design spelled out, but no analysis has actually been done. Based on the hypotheses and design, the editorial board makes an accept, reject, or revise-and-resubmit decision. This ultimate accept/reject decision is made publicly, pre-registering the accepted study. It is similar to what happens these days with grant proposals and IRB reviews, but it is explicitly tied to a public publication commitment. Once the data are gathered and results processed, the paper is checked against the accepted and publicly registered hypotheses and design. So long as it conforms, it is published irrespective of the results.

Theorists would play a crucial role in the initial decision. If the study examines the effect of a policy, on what basis should we expect any meaningful effects? Are the hypotheses compelling? Will the results do much to affect our understanding of an important problem? Strong deductive reasoning and substantive familiarity with the problem are what shape our priors, and such is exactly what we need to make this call.

A steadier stream of meaningful puzzles

Popperian epistemology proposes that knowledge advances through falsification, but current pro-“significance” bias means that this happens too infrequently. We end up continuing to give credibility to theoretical propositions that empirical researchers are too afraid to falsify. We fail to offer theorists all the puzzles that they deserve. Ex ante science will provide a steadier stream of puzzles meaning more work and more fun for those who want to focus mostly on theory, and more meaningful interaction between theory and empirical research.

# New paper by Casey, Glennerster, and Miguel is real progress in studying community directed development

At the Yale field experiments workshop today, Katherine Casey from Brown University (soon to be Stanford GSB) presented a brilliantly executed study by herself, Rachel Glennerster, and Edward Miguel evaluating the impact of a community directed development (CDD) program on local public goods and community social institutions in rural Sierra Leone. Here is a link to the working paper (PDF). I think this paper is a must-read for those interested not only in decentralization and democratization of rural social institutions in poor countries, but in field experiments, policy analysis, and causal inference more generally. In fact, I would suggest that if you have an interest in any of these things, that you stop what you are doing (including reading this post) and look carefully at their paper right now. Here is the abstract,

Although institutions are believed to be key determinants of economic performance, there is limited evidence on how they can be successfully reformed. The most popular strategy to improve local institutions in developing countries is “community driven development” (CDD). This paper estimates the impact of a CDD program in post-war Sierra Leone using a randomized experiment and novel outcome measures. We find positive short-run effects on local public goods provision, but no sustained impacts on fund-raising, decision-making processes, or the involvement of marginalized groups (like women) in local affairs, indicating that CDD was ineffective at durably reshaping local institutions.

They indicate that, for the most part, these results are consistent with what other CDD studies have produced, raising serious questions about donors’ presumptions that CDD programs can really affect local social institutions. In a recent review of CDD impact evaluations, my co-authors and I found the same thing (see here, gated link). Given the centrality of CDD programs in current development programming, this comes as a call to reflect a bit on why things might not be going as we would like.

For those who don’t really care that much about CDD, there are four methodological aspects of this paper that are simply terrific and therefore warrant that you read it:

1. They very effectively address the multiple outcomes, multiple comparisons, and associated “data dredging” problems that have plagued research on CDD in particular (see again our review essay) and pretty much every recent analysis of a field experiment that I have read. Their approach involves a few steps, with the last step being the most innovative. The steps are, first, articulating a clear set of core hypotheses and registering (via a Poverty Action Lab evaluation registry) these before the onset of the program; second, grouping outcome indicators as the bases of tests for these hypotheses; third, pre-specifying and registering their econometric models; and, finally, using seemingly-unrelated regressions (SUR, link) to produce standard errors on individual outcomes while taking into account dependence across indicators, and then using omnibus mean-effects tests to obtain a single standardized effect and p-value for each core hypothesis. For example, to test the hypothesis that the program would increase lasting social capital, they have about 40 measures. The SUR produces dependence-adjusted standard errors on each of these outcomes, and then the omnibus mean-effects test allows them to combine the results from these individual regressions to present a single standardized effect and p-value for the social capital hypothesis. That’s a huge step forward for analyses of field experiments. Effect synthesis and omnibus testing like this needs to become much more regularized in our statistical practice (see here for a recent post on omnibus tests of covariate balance).

2. Their hypotheses are motivated by a clear theoretical model that formalizes what the authors understand as being donors’ and the Bank’s thinking about how CDD affects community-level social dynamics. The model explains what constraints and costs they hypothesize as being alleviated such that the program might improve public goods and, potentially, social capital outcomes. This really shores up one’s confidence in the results of the empirical analysis, because it is clear how the hypotheses were ex ante established prior to the analysis.

3. A propos to some recent discussion over at the World Bank Development Impact blog (link), they study outcomes measured both during the program cycle and some time afterward, to assess programmatic effects on provision of public goods and downstream effects on social capital.

4. To measure effects on social capital, they created minimally intrusive performance measures based on “structured group activities” that closely resemble real-world situations in which collective problem solving would be required. For example, a social capital measure was based on the offer of a matching grant to communities, with the only condition to receive the grant being that the community had to coordinate to come up with matching funds and put-in for the grant. In the event, they found that only about half of communities overall were able to take up the matching grant, and the treatment effect on this take-up rate was effectively zero.

Katherine indicated that for them, the null result on social capital effects was the most important take-away point. This provoked a salient question during the Q&A: how will journal editors react to this, that the core finding of the paper is a null result on a hypothesis that was derived from a theory that was motivated only because it seemed to characterize what donors and Bank program staff thought would happen? As a political scientist, I am sympathetic to this concern. I can imagine the cranky political science journal editor saying, “Aw, well, this was a stupid theory anyway. Why should I publish a null result on an ill-conceived hypothesis? Why aren’t they testing a better theory that actually explains what’s going on? I mean, why don’t they use to data to prove the point that they want to make and teach us what is really going on?” Reactions like this, which I do hear fairly often, come in direct tension with ex ante science, and essentially beg researchers to do post-hoc analysis. Hopefully publishing norms in economics won’t force the authors to spoil what is a great paper and probably the most well-packaged, insightful null result I’ve ever read.

# (technical) Multivariate randomization balance tests in small samples

In their 2008 Statistical Science article, Hansen and Bowers (link) propose randomization-based omnibus tests for covariate balance in randomized and observational studies. The omnibus tests allow you to test formally whether differences across all the covariates resemble what might happen in a randomized experiment. Previous to their paper, most researchers tested balance one covariate at a time, making ad hoc judgments about whether apparent imbalance in one or another covariate suggested deviations from randomization. What’s nice about Hansen and Bowers’s approach is that it systematizes such judgments into a single test.

To get the gist of their approach, imagine a simple random experiment on $latex N$ units for which $latex M = N/2$ [CDS: note, this was corrected from the original; results in this post are for a balanced design, although the Hansen and Bowers paper considers arbitrary designs.] units are assigned to treatment. Suppose for each unit i we record prior to treatment a $latex P$-dimensional covariate, $latex x_i = (x_{i1},\hdots,x_{iP})’$. Let $latex \mathbf{X}$ refer to the $latex N$ by $latex P$ matrix of covariates for all units. Define $latex d_p$ as the difference in the mean values of covariate $latex x_p$ for the treated and control groups, and let $latex \mathbf{d}$ refer to the vector of these differences in means. By random assignment, $latex E(d_p)=0$ for all $latex p=1,..,P$, and $latex Cov(\mathbf{d}) = (N/(M^2))S(\mathbf{X})$, where $latex S(\mathbf{X})$ is the usual sample covariance matrix [CDS: see update below on the unbalanced case]. Then, we can compute the statistic, $latex d^2 = \mathbf{d}’Cov(\mathbf{d})^{-1} \mathbf{d}$. In large samples, Hansen and Bowers explain that randomization implies that this statistic will be approximately chi-square distributed with degrees of freedom equal to $latex rank[Cov(\mathbf{d})]$ (Hansen and Bowers 2008, 229). The proof relies on standard sampling theory results.

These results from the setting of a simple randomized experiment allow us to define a test for covariate balance in cases where the data are not from an experiment, but rather from a matched observational study, or where the data were from an experiment, but we might worry that there were departures from randomization that lead to confounding. In Hansen and Bowers’s paper, the test that they define relies on the large sample properties of $latex d^2$. Thus, the test consists of computing $latex d^2$ for the sample at hand, and computing a p-value against the limiting $latex \chi^2_{rank[Cov(\mathbf{d})]}$ distribution that should obtain under random assignment.

I should note that in Hansen and Bowers’s paper, they focus not on the case of a simple randomized experiment, but rather on cluster- and block-randomized experiments. It makes the math a bit uglier, but the essence is the same.

The question I had was, what is the small sample performance of this test? In small samples we can use $latex d^2$ to define an exact test. Does it make more sense to use the exact test? In order to address these questions, I performed some simulations against data that were more or less behaved. These included simulations with two normal covariates, one gamma and one log-normal covariate, and two binary covariates. (For the binary covariates case, I couldn’t use a binomial distribution, since this sometimes led to cases with all 0’s or 1’s. Thus, I fixed the number of 0’s and 1’s for each covariate and randomly scrambled them over simulations.) In the simulations, the total number of units was 10, and half were randomly assigned to treatment. Note that this implies 252 different possible treatment profiles.

The results of the simulations are shown in the figure below. The top row is the for the normal covariates, the second row for the gamma and log-normal covariates, and the bottom row for the binary covariates. I’ve graphed the histograms for the approximate and exact p-values in the left column; we want to see a uniform distribution. In the middle column is a scatter plot of the two p-values with a reference 45-degree line; we want to see them line up on the 45-degree line. In the right column, I’ve plotted distribution of the computed $latex d^2$ statistic against the limiting $latex \chi^2_{rank[Cov(\mathbf{d})]}$ distribution; we want to see that they agree.

When the data are normal, the approximation is quite good, even with only 10 units. In the two latter cases, approximations do not fare well, as the test statistic distribution deviates substantially from what would be expected asymptotically. The rather crazy-looking patterns that we see in the binary covariates case is due to the fact that there are a small discrete number of difference in mean values possible. Presumably in large samples this would smooth out.

What we find overall though is that the approximate p-value tends to be biased toward 0.5 relative to the exact p-value. Thus, when the exact p-value is large, the approximation is conservative, but as the exact p-value gets small, the approximation becomes anti-conservative. This is most severe in the skew (gamma and log-normal) covariates case. In practice, one may have no way of knowing whether the relevant underlying covariate distribution is better approximated as normal or skew. Thus, it would seem that one would want to always use the exact test in small samples.

Code demonstrating how to compute Hansen and Bowers’s approximate test, the exact test, as well as code for the simulations and graphics is here (multivariate exact code).

Update

The general expression for $latex Cov(\mathbf{d})$ covering an unbalanced randomized design is,

$latex Cov(\mathbf{d}) = \frac{N}{M(N-M)}S(\mathbf{X})$.

# (technical) Post-treatment bias can be anti-conservative!

A little rant on the sad state of knowledge about post-treatment bias: For some reason I still see a lot of people using control strategies (typically, regression) that use post-treatment outcomes that are intermediate between the treatment and endpoint outcome of interest. I have heard people who do so say that this is somehow necessary to show that the “effects” that they estimate in the reduced form regression of treatment on endpoint outcome are not spurious. Of course this is incorrect. To show the relationship “goes away” after controlling for the intermediate outcome does not indicate that the effect is spurious. It could just as well be that the treatment affects the endpoint outcome mostly through the intermediate outcome.

I have also heard people say that controlling for intermediate, post-treatment outcomes is somehow “conservative” because controlling for the post-treatment outcome “will only take away from the association” between the treatment and the outcome. Of course, this is also incorrect. Controlling for a post-treatment variable can easily be anti-conservative, producing a coefficient on the treatment that is substantially larger than the actual treatment effect. This happens when the intermediate outcome exhibits a “suppression” effect, for example, when the treatment has a negative association with the intermediate outcome, but the intermediate outcome then positively affects the endpoint outcome. Here is a straightforward demonstration (done in R):



N <- 200
z <- rbinom(N,1,.5)
ed <- rnorm(N)
d <- -z + ed
ey <- rnorm(N)
y <- z + d + ey
print(coef(summary(lm(y~z))), digits=2)
Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.0049       0.14   0.035     0.97
z            -0.1109       0.20  -0.555     0.58
print(coef(summary(lm(y~z+d))), digits=2)
Estimate Std. Error t value Pr(>|t|)
(Intercept)   -0.078      0.093   -0.84  4.0e-01
z              1.034      0.149    6.95  5.3e-11
d              1.046      0.064   16.23  3.6e-38

In the example above, z is the treatment variable, and y is the endpoint outcome, while d is an intermediate outcome. (The data generating process resembles a binomial assignment experiment.) The causal effect of z is properly estimated in the first regression. The effect is indistinguishable from 0. The problems that arise when controlling for a post-treatment intermediate outcome are shown in the second regression. Now the coefficient on z is 1 with a very low p-value!

UPDATE

A question I received offline was along the lines of “what if you control for the post-treatment variable and your effect estimate doesn’t change. Surely this strengthens the case that what you’ve found is not spurious.” I don’t think that is correct. The case for having a well identified effect estimate is based only on having properly addressed pre-treatment confounding. To show that a post-treatment variable does not alter the estimate has no bearing on whether this has been achieved or not. Thus, the post-treatment conditioning is pretty much useless for demonstrating that a causal relation is not spurious.

The one case where post-treatment conditioning provides some causal content is in the case of mediation. But there, exclusion restriction or effect-homogeneity assumptions have to hold, otherwise the mediation analysis may produce misleading results. On these points, I suggest looking at this very clear paper by Green, Ha, and Bullock (ungated preprint). A more elaborate paper (though not quite as intuitive in its presentation) is this one by Imai, Keele, Tingley, and Yamamoto (working paper).