Should you use frequentist standard errors with causal estimates on population data? Yes.

Suppose you are studying the effects of some policy adopted at the state level in the United States, and you are using data from all 50 states to do it. Well,

When a researcher estimates a regression function with state level data, why are there standard errors that differ from zero? Clearly the researcher has information on the entire population of states. Nevertheless researchers typically report conventional robust standard errors, formally justified by viewing the sample as a random sample from a large population. In this paper we investigate the justification for positive standard errors in cases where the researcher estimates regression functions with data from the entire population. We take the perspective that the regression function is intended to capture causal effects, and that standard errors can be justified using a generalization of randomization inference. We show that these randomization-based standard errors in some cases agree with the conventional robust standard errors, and in other cases are smaller than the conventional ones.

From a new working paper on “Finite Population Causal Standard Errors ” by the econometrics all-star team of Abadie, Athey, Imbens, and Wooldridge (updated link): link.

I have been to a few presentations of papers like this where someone in the audience thinks they are making a smart comment by noting that the paper uses population data, and so the frequentist standard errors “don’t really make sense.” Abadie et al. show that such comments are often misguided, arising from a confusion over how causal inference differs from descriptive inference. Sure — there is no uncertainty as to what is the value of the regression coefficient for this population given the realized outcomes. But the value of the regression coefficient is not the same as the causal effect.

To understand the difference, it helps to define causal effects precisely. A causal effect for a given unit in the population is most coherently defined to be a comparison between the outcome observed under a given treatment (being the “state level policy” in the case of the example above) and what would obtain were that same unit to be given another treatment. It is useful to imagine this schedule of treatment-value-specific outcomes as an array of “potential outcomes.” Population average causal effects take the average of the unit level causal effects in a given population. Now, suppose that there is some random (at least with respect to what the analyst can observe) process through which units in the population are assigned treatment values. Maybe this random process occurred because a bona fide randomized experiment was run on the population, or maybe it was the result of “natural” stochastic processes (that is, not controlled by the analyst). Then, for each unit we only get to observe the potential outcome associated with the treatment received, and not the other “counterfactual” potential outcomes associated with the other possible treatments. As such, we cannot actually construct the population average causal effect directly. Doing so would require that we were able to compute each of the unit-level causal effects. So, we have to estimate the population average causal effect using the incomplete potential outcomes data available to us. If the results of the random treatment assignment processes had turned out differently, the estimate we would obtain could very well differ as well (since there would be a different set of observed and unobserved potential outcomes). Even though we have data from everyone in the population, we are lacking the full schedule of potential outcomes that would allow us to estimate causal effects without uncertainty.

As it turns out, the random treatment assignment process is directly analogous to the random sampling of potential outcomes, in which case we can use standard sample theoretic results to quantify our uncertainty and compute standard errors for our effect estimates. Furthermore, as a happy coincidence, such sample theoretic standard errors are algebraically equivalent or even conservatively approximated by the “robust” standard errors that are common in current statistical practice. (This previous sentence was revised so that its meaning is now clearer.) It’s a point that Peter Aronow and I made in our 2012 paper on using regression standard errors for randomized experiments (link), and a point that Winston Lin develops even further (link; also see Berk’s links in the comments below to Winston’s great discussion of these results). Abadie et al. take this all one step further by indicating that this framework for inference makes sense for observational studies too.

Now, some of you might know of Rosenbaum’s work (e.g. this) and think this has all already been said. That’s true, to a point. But whereas Rosenbaum’s randomization inference makes use of permutation distributions for making probabilistic statements about specific causal hypotheses, Abadie et al.’s randomization inference allows one to approximate the randomization distribution of effect estimates without fixing causal hypotheses a priori. (See more on this point in this old blog post, especially in the comments: link).


One-year post at UC-Berkeley with Berkeley Initiative for Transparency in the Social Sciences

The fine CEGA folks at UC-Berkeley are recruiting a quant-savvy social science grad to work with them on an important research transparency initiative:

Interested in improving the standards of rigor in empirical social science research? Eager to collaborate with leading economists, political scientists and psychologists to promote research transparency? Wishing to stay abreast of new advances in empirical research methods and transparency software development? The Berkeley Initiative for Transparency in the Social Sciences (BITSS) is looking for a Program Associate to support the initiative’s evaluation and outreach efforts. The candidate is expected to engage actively with social science researchers to raise awareness of new and emerging tools for research transparency. Sounds like fun? Apply now!

More information is here: link.


Big data and social science: Mullainathan’s Edge talk

The embedded video links to an Edge talk with Sendhil Mullainathan on the implications of big data for social science. His thoughts come out of research he is doing with computer scientist Jon Kleinberg [website] applying methods for big data to questions in behavioral economics.

Mullainathan focuses on how inference is affected when datasets increase widthwise in the number of features measured—that is, increasing “K” (or “P” for you ML types). The length of the dataset (“N”) is, essentially, just a constraint on how effectively we can work with K. From this vantage point, the big data “revolution” is the fact that we can very cheaply construct datasets that are very deep in K. He proposes that with really big K, such that we have data on “everything,” we can switch to more “inductive” forms of hypothesis testing. That is, we can dump all those features into a machine learning algorithm to produce a rich predictive model for the outcome of interest. Then, we can test an hypothesis about the importance of some variable by examining the extent to which the model relies on that variable for generating predictions.

I see three problems with this approach. First, just like traditional null hypothesis testing it is geared toward up or down judgments about “significance” rather than parameter (or “effect size”) estimation. That leaves the inductive approach just as vulnerable to fishing, p-hacking, and related problems that occur with current null hypothesis testing.* It is also greatly limits what we really learn from an analysis (statistical significance is not substantive significance, and so on). Second, scientific testing is typically some form of causal inference, and yet the inductive-predictive approach that Mullainathan described in his talk is oddly blind to questions of causal identification. (To be fair, it is a point that Mullainathan admits in his talk.) The possibilities of post-treatment bias and bias amplification are two reasons that including more features does not always yield better results when doing causal inference (although bias amplification problems would typically diminish as one approaches having data on “everything”). Thus, without careful attention to post-treatment bias for example, the addition of features in an analysis can lead you to conclude mistakenly that a variable of interest has no causal effect when in fact it does. The third reason goes along with a point that Daniel Kahneman makes toward the end of the video: the predictive strength of a variable relative to other variables is not an appropriate criterion for testing an hypothesized cause-effect relationship. But, the inductive approach that Mullainathan describes would be based, essentially, on measuring relative predictive strength.

Nonetheless, the talk is thought provoking and well worth watching. I also found the comments by Nicholas Christakis toward the end of the talk to be very thoughtful.

*Zach raises a good question about this in the comments below. My reply basically agrees with him.


jobs: reducing gender inequality research consultancy with World Bank

Working in collaboration with various partners, [the consultant] will focus mainly on the design, implementation, and data analysis for a set of rigorous impact evaluation studies, also working to design innovative development interventions to address gender inequality.

See the attached terms of reference for more details including how to apply: [PDF].


Meta-analysis and effect synthesis: what, exactly, is the deal?

Suppose we have perfectly executed and perfectly consistent, balanced randomized control trials for a binary treatment applied to populations 1 and 2. Suppose that even the sample sizes are the same in each trial ($latex n$). We obtain consistent treatment effect estimates $latex \hat \tau_1$ and $latex \hat \tau_2$ from each, respectively, with consistent estimates of the asymptotic variances of $latex \hat \tau_1$ and $latex \hat \tau_2$ computed as $latex \hat v_1$ and $latex \hat v_2$, respectively. As far as asymptotic inference goes, suppose we are safe to assume that $latex \sqrt{n}(\hat \tau_1 – \tau) \overset{d}{\rightarrow} N(0, V_1)$ and $latex \sqrt{n}(\hat \tau_2 – \tau) \overset{d}{\rightarrow} N(0, V_2)$, with $latex N\hat v_1 \overset{p}{\rightarrow} V_1$ and $latex N\hat v_2 \overset{p}{\rightarrow} V_2$.* (This is pretty standard notation, where $latex \overset{d}{\rightarrow}$ is convergence in distribution, and $latex \overset{p}{\rightarrow}$ is convergence in probability, under the sample sizes for each experiment growing large.) Even with the same sample sizes in both population, we may have that $latex V_1 > V_2$, because outcomes are simply noisier in population 1. Suppose this is the case.

A standard meta-analytical effect synthesis will compute a synthesized effect by taking a weighted average where the weights are functions, either in part or in their totality, of the inverses of the estimated variances. That is, weights will be close or equal to $latex 1/\hat v_1$ and $latex 1/\hat v_2$. Of course, if $latex \tau_1 = \tau_2 = \tau$, then this inverse variance weighted mean is asymptotic variance-minimizing estimator for $latex \tau$. This is the classic minimum distance estimation result. The canonical econometrics reference for the optimality of inverse variance weighted estimator for general problems is Hansen (1982) [link], although it is covered in any graduate econometrics textbook.

But what if there is no reason to assume $latex \tau_1 = \tau_2 = \tau$? Then, how should we interpret the inverse variance weighted mean, which for finite samples would tend to give more weight to $latex \hat \tau_2$? Perhaps one could interpret it in Bayesian terms. From a frequentist perspective though, which would try to relate this to stable population parameters, it seems to be interpretable only as “a good estimate of what you get when you compute the inverse variance weighted mean from the results of these two experiments,” which of course gets us nowhere.

Now, I know that meta-analysis textbooks talk about how, when it doesn’t make sense to assume assume $latex \tau_1 = \tau_2$, one should seek to explain the heterogeneity rather than produce synthesized effects. But the standard approaches for doing so rely on assumptions of conditional exchangeability— that is, replacing $latex \tau_1 = \tau_2$ with $latex \tau_1(x) = \tau_2(x)$, where these are effects for subpopulations defined by a covariate profile $latex x$. Then, we effectively apply the same minimum distance estimation logic, using inverse variance weighting to compute the $latex \tau_2(x)$, most typically with an inverse variance weighted linear regression on the components of $latex x$. The modeling assumptions are barely any weaker than what one assumes to produce the synthesized estimate. So does this really make any sense either?

It seems pretty clear to me that the meta-analysis literature is in need of a “credibility revolution” along the same lines as we’ve seen in the broader causal inference literature. That means (i) thinking harder about the estimands that are the focus of the analysis, (ii) entertaining an assumption of rampant effect heterogeneity, and (iii) understanding the properties and robustness of estimators under (likely) misspecification of the relationship between variables that characterize the populations we study (the $latex X_j$s for populations indexed by $latex j$) and the estimates we obtain from them (the $latex \hat \tau_j$’s).

*Edited based on Winston’s corrections!