Tips on observational study research design

Here are slides from a talk I gave last week as part of a series hosted by the Applied Statistics Center at Columbia: PDF

The talk was intended for grad students working on dissertation research plans.  The focus was on strategies for collecting data to analyze the effects of micro-level development policies.  I tried to make a few points, including:

  1. If your research design uses matching to control for confounding (rather than, say, an instrumental variable), you should still have a verifiable source (or sources) of exogenous variation that can explain why two units who have the same background characteristics may nonetheless differ in whether they were exposed to the program or not.  To simply match on the available variables and then claim that differences in program exposure can now be considered “random” is not convincing.  There may be a single exogenous source of variation, or it may be a collection of fortuitous accidents.  I gave as an example my study with Michael Gilligan and Eric Mvukiyehe on the effects of an ex-combatant reintegration program in Burundi (PDF).  There, the reason that some ex-combatants did not get the program was because of a bureaucratic dispute that caused one of the implementing NGO to fail in delivering benefits.  The matching in that study was used to control for the “incidental” differences in the personal and community level characteristics of ex-combatants who were designated to receive benefits from that particular NGO.  These thoughts echo some of what Chris Blattman has said recently about studies that fallaciously claim that matching somehow solves causal identification problems (link). I agree with Chris: matching is nothing more than a way to circumvent certain modeling assumptions.
  2. There is a lot of work that you can do before you hit the field to improve your data collection strategy.  This includes using available data on studies comparable to the one that you are proposing to design simulations to study power or the robustness of various design alternatives.  There are tons of datasets out there that you can find on Google.

There was a lively Q&A.  One person asked how we find those “verifiable sources of exogenous variation” and I answered that there is no recipe.  You come across them when you are working through the fine details of whatever you are studying.  You just need to be trained to recognize them.  Another person asked a technical question about how to integrate sampling weights into a matching estimator.  My reply was that if you are interested in estimating the effect of the treatment on the treated, you simply set for yourself the target reweighting the “control” units so that they balance the weighted sample of “treated” units.

All in all a nice discussion.  Macartan Humphreys also presented, discussing some of the challenges of implementing a rigorous sampling design in places where ex ante information on population sizes, locations, etc. is sparse.