Should you use frequentist standard errors with causal estimates on population data? Yes.

Suppose you are studying the effects of some policy adopted at the state level in the United States, and you are using data from all 50 states to do it. Well,

When a researcher estimates a regression function with state level data, why are there standard errors that differ from zero? Clearly the researcher has information on the entire population of states. Nevertheless researchers typically report conventional robust standard errors, formally justified by viewing the sample as a random sample from a large population. In this paper we investigate the justification for positive standard errors in cases where the researcher estimates regression functions with data from the entire population. We take the perspective that the regression function is intended to capture causal effects, and that standard errors can be justified using a generalization of randomization inference. We show that these randomization-based standard errors in some cases agree with the conventional robust standard errors, and in other cases are smaller than the conventional ones.

From a new working paper on “Finite Population Causal Standard Errors ” by the econometrics all-star team of Abadie, Athey, Imbens, and Wooldridge (updated link): link.

I have been to a few presentations of papers like this where someone in the audience thinks they are making a smart comment by noting that the paper uses population data, and so the frequentist standard errors “don’t really make sense.” Abadie et al. show that such comments are often misguided, arising from a confusion over how causal inference differs from descriptive inference. Sure — there is no uncertainty as to what is the value of the regression coefficient for this population given the realized outcomes. But the value of the regression coefficient is not the same as the causal effect.

To understand the difference, it helps to define causal effects precisely. A causal effect for a given unit in the population is most coherently defined to be a comparison between the outcome observed under a given treatment (being the “state level policy” in the case of the example above) and what would obtain were that same unit to be given another treatment. It is useful to imagine this schedule of treatment-value-specific outcomes as an array of “potential outcomes.” Population average causal effects take the average of the unit level causal effects in a given population. Now, suppose that there is some random (at least with respect to what the analyst can observe) process through which units in the population are assigned treatment values. Maybe this random process occurred because a bona fide randomized experiment was run on the population, or maybe it was the result of “natural” stochastic processes (that is, not controlled by the analyst). Then, for each unit we only get to observe the potential outcome associated with the treatment received, and not the other “counterfactual” potential outcomes associated with the other possible treatments. As such, we cannot actually construct the population average causal effect directly. Doing so would require that we were able to compute each of the unit-level causal effects. So, we have to estimate the population average causal effect using the incomplete potential outcomes data available to us. If the results of the random treatment assignment processes had turned out differently, the estimate we would obtain could very well differ as well (since there would be a different set of observed and unobserved potential outcomes). Even though we have data from everyone in the population, we are lacking the full schedule of potential outcomes that would allow us to estimate causal effects without uncertainty.

As it turns out, the random treatment assignment process is directly analogous to the random sampling of potential outcomes, in which case we can use standard sample theoretic results to quantify our uncertainty and compute standard errors for our effect estimates. Furthermore, as a happy coincidence, such sample theoretic standard errors are algebraically equivalent or even conservatively approximated by the “robust” standard errors that are common in current statistical practice. (This previous sentence was revised so that its meaning is now clearer.) It’s a point that Peter Aronow and I made in our 2012 paper on using regression standard errors for randomized experiments (link), and a point that Winston Lin develops even further (link; also see Berk’s links in the comments below to Winston’s great discussion of these results). Abadie et al. take this all one step further by indicating that this framework for inference makes sense for observational studies too.

Now, some of you might know of Rosenbaum’s work (e.g. this) and think this has all already been said. That’s true, to a point. But whereas Rosenbaum’s randomization inference makes use of permutation distributions for making probabilistic statements about specific causal hypotheses, Abadie et al.’s randomization inference allows one to approximate the randomization distribution of effect estimates without fixing causal hypotheses a priori. (See more on this point in this old blog post, especially in the comments: link).

4 Replies to “Should you use frequentist standard errors with causal estimates on population data? Yes.”

Hi Cyrus,

Great post, thanks. Just FYI for your readers, Winston blogged about his paper linked above here (http://blogs.worldbank.org/impactevaluations/node/847) and here (http://blogs.worldbank.org/impactevaluations/node/849).

Thanks for the reminder, Berk — I’ve now noted this above.

Cyrus and Berk, thanks for your kind words about my blog posts (which Berk did a lot to help improve) and paper.

I like Cyrus’s phrase “happy coincidence” in “as a happy coincidence, such sample theoretic standard errors are algebraically equivalent or even somewhat conservative relative to the ‘robust’ standard errors …”. Writing to a friend recently, I called it a divine coincidence, stealing a phrase from Olivier Blanchard and my ex-classmate Jordi Gali. To limit the length of my paper and blog posts, I didn’t try to give intuition for this coincidence. But it’s basically “two wrongs make a right”, and for anyone unfamiliar with it, I recommend two expositions of the simplest case (the difference in means from a completely randomized experiment):

Freedman, Pisani, and Purves, Statistics, 4th ed., pp. 508-511 and footnotes
Reichardt and Gollob, “Justifying the use and increasing the power of a t test for a randomized experiment with a convenience sample”, Psychological Methods, Vol 4(1), Mar 1999, 117-128

On Rosenbaum’s (and R.A. Fisher’s) permutation-test style of randomization inference: Permutation inference has the advantage of not relying on large-sample approximations, but as Cyrus and others have written, when permutation tests are inverted to construct confidence intervals, the validity of those CIs can depend on additional assumptions. People who accept this point sometimes argue that it’s still fine to use classical permutation inference to test the “strong” null hypothesis that treatment had no effect on anyone. However, it’s not clear to me that we should be satisfied with mere validity of a test (in the technical sense of correct Type I error rate). Fay and Proschan have a very useful (but technical) discussion of desirable properties for tests (section 3) that concludes, “Most reasonable tests are at least PAV [pointwise asymptotically valid], asymptotically strictly unbiased, and monotonic.” (I need to get back to other things, so I won’t try to explain here what an unbiased test is.)

In sections 1.2 and 4.3 of my thesis, I gave a more informal discussion (with simulations and pointers to other people’s papers) of the Wilcoxon-Mann-Whitney rank sum test (a test recommended by Rosenbaum, Imbens, and others). I wrote that the rank sum test “is valid for the strong null, but it is sensitive to certain kinds of departures from the strong null and not others. For example, it is more likely to reject the null when treatment narrows the spread of the outcome distribution and there are more treated than control patients, or when treatment widens the spread and there are more control than treated patients. It is less likely to reject when the opposite is true. These properties complicate the test’s interpretation and are probably not well-known to most of its users.”

After graduating, I came across some very useful critiques of rank tests by Thomas Lumley, who highlights a different issue. The slides at the second link also have very thoughtful comments on teaching statistics.

http://notstatschat.tumblr.com/post/63237480043/rock-paper-scissors-wilcoxon-test

http://cbe.anu.edu.au/media/2753826/transitive-anu.pdf

http://faculty.washington.edu/tlumley/vanderbilt-seminar.pdf

Thanks, Winston. The Lumley material is quite interesting — I’ve never seen a discussion of the intransitivity of tests based on ordinal statistics (e.g., ranks), but it makes perfect sense. I see it as another reason to put stock in methods geared toward estimation, as it is in doing so that one operates on the scale necessary to make the kinds of tradeoffs that resolve such transitivity problems.

Comments are closed.