Reflecting on some of the recent discussions of matching as a tool for causal analyses in social science (see here as well as this really nice commentary—hat tip, Chris Blattman), I wonder if it’s useful to make a distinction between “positive” versus “negative” causal identification. Continue reading ““Positive” versus “negative” causal identification”

### Clustering, unit level randomization, and inference (updated, Nov 5)

I wanted to look into the case where you have an experiment in which your units of analysis are naturally clustered (e.g., households in villages), but you randomize *within* clusters. The goal is to estimate a treatment effect in terms of difference in means, using design-based principles and frequentist methods for inference.

Randomization ensures that the difference in means is unbiased for the sample average treatment effect. Using only randomization as the basis for inference, I know the variance of this estimator is *not* identified for the sample, as it requires knowledge of the covariance of potential outcomes. But the usual sample estimators for the variance are conservative. If, however, the experiment is run on a random sample from an infinitely large population, then the standard methods are unbiased for the mean and for the variance of the difference in means estimator applied to this population (refer to Neyman, 1990; Rubin, 1990; Imai, 2008; for finite populations, things are more complicated, and the infinite population assumption is often a reasonable approximation). I understand that these are the principles that justify the usual frequentist estimation techniques for inference on population level treatment effects in randomized experiments.

The question I had was, how should we account for dependencies in potential outcomes within clusters? Continue reading “Clustering, unit level randomization, and inference (updated, Nov 5)”

### Superpopulations and inference with “all the data”

Doug [Rivers, I presume?] added a nice comment on superpopulation versus (sampling) design-based approaches to inference in regression modeling (link). The comment is to a post on Andy Gelman’s blog from over a year ago about what to do when one is trying to fit regressions to data on the “full population.”

One comment: Doug suggests that “the definition of the regression parameters is arbitrary (why not LAD or some other estimator applied to the population?) and it’s not obvious how to interpret the parameters.” My reading of Goldberger (reference) and Angrist and Pischke (reference) suggests that OLS is well motivated, based on the best linear approximation criterion. Going beyond the trivial case of a binary predictor variable (for which OLS is a convenient way to calculate mean differences), with a continuous predictor, linear approximation is an easy-to-interpret summary of the predictor’s relationship to the outcome variable. Along similar lines, logistic regression is well motivated as the maximum entropy estimator of a relationship to a binary outcome. So I am not sure that it is all that arbitrary.

### More on matching and identification

At the Social Science Statistics blog, Richard Nielsen posts a response (link) to the Chris Blattman’s recent post (link) on problems with the way people interpret what matching does: Continue reading “More on matching and identification”

### Tips on observational study research design

Here are slides from a talk I gave last week as part of a series hosted by the Applied Statistics Center at Columbia: PDF

The talk was intended for grad students working on dissertation research plans. The focus was on strategies for collecting data to analyze the effects of micro-level development policies. I tried to make a few points Continue reading “Tips on observational study research design”