hidden confounding in survey experiments

I had the opportunity to participate in a fun seminar via Skype with faculty and students at Uppsala University’s department of peace and conflict research. We were discussing exciting new avenues for using experimental methods to study microfoundations of conflict and IR theories. The discussion was led by Allan Dafoe (Berkeley, visiting at Uppsala), who is doing really interesting work on reputation and strategic interaction (link).

An interesting point on “hidden confounding” in survey experiments came up that I don’t think gets enough play in analyses of survey experiments, so I thought I’d relay it here as a reference and also to see if others have any input. A common approach in a survey experiment is to provide subjects with hypothetical scenarios. The experimental treatments then consist of variations on the content of the scenarios.

What makes this kind of research so intriguing is that it would seem that you can obtain exogenous variation in circumstances that rarely obtains in the real world. Thus, if your experiment involves a scenario about an international negotiation over a dispute, you could vary, say, the regimes from which the negotiators come in a manner that does not occur frequently in the real world.

The problem is that subjects come to a survey experiment with prior beliefs about “what things go with what”—that is, about how salient features correlate. In our example, people will tend to associate regime types with things like national wealth or region. In that case, by manipulating the negotiators’ regime types in the experiment, you are implicitly changing people’s beliefs about other features of the countries from which the negotiators come. You can try to hold these things “constant”—e.g., by having one treatment where negotiator A comes from a “rich democracy” and another where negotiator A comes from a “rich dictatorship”—but to the extent that you are creating a scenario that departs from what typically occurs in the real world, you might be causing the subject to wonder whether we are talking about some “unusual” circumstance. If so, the subject might apply a different evaluative framework than what the subject would apply to “usual” circumstances. Thus, you are obtaining a causal estimate that is dependent on the frame of reference, which may not be generalizable.

It’s a bit thorny, so what are solutions? Ironically, it seems to me that one solution would be to focus the experiment on treatments that are “plausibly exogenous.” One could focus on conditions that respond easily to choices, and where choices in either direction are conceivable. Or, one could focus the experiment on things that can vary randomly—like weather, most famously. I find this ironic because it seems that the survey experiment doesn’t get us very far from what we attempt to do with natural experiments. It would seem that the sweet spot for survey experiments would be for things that we are pretty sure could occur as a natural experiment, but either haven’t occurred often enough or haven’t been measured, in which case we can’t just study the natural experiment directly. Applying this rule would greatly limit the areas of application for survey experiments, but I think this formula would result in survey experiments that have more credible causal interpretations.

(By the way, Allan clued me into a discussion of this very point in a current working paper by Michael Tomz and Jessica Weeks: link.)

UPDATE: Allan provided this initial reaction:

I actually think the problem with survey experiments is a bit worse than you describe. It’s not just that confounding can be avoided in survey experiments by focussing on those factors that are plausibly manipulable; one has to vary factors that are in the population typically uncorrelated with other factors of interest, given the scenario. That is, one wants that the respondents believe Pr(Z|X1)=Pr(Z|X2) where X1 and X2 are two values of the treatment condition, and Z is any other factor of potential relevance that is not a consequence of treatment. For example, the decision of whether the US should stay in Afghanistan (X1) is plausibly manipulable and could plausibly go either way; Obama could decide to leave (X2). But even though such a counterfactual is plausible and could involve a hypothetical manipulation, we are unlikely to believe that Pr(Z|X1)=Pr(Z|X2), where Z could be the domestic support for war, or the strength of the US economy, or the resilience of the Taliban. So perhaps this implies that the only treatments that will not generate information leakage are either (1) those that are exogenous to begin with in the world (which are thus relatively easy to study using observational data), or (2) those that provide a compelling hypothetical natural experiment to account for the variation. So in this sense—perhaps I am actually just restating your main point—survey experiments only generate clear causal inferences if the key variation arises from a credible (hypothetical) natural experiment.
Share

matching with multilevel data, discussing some strategies

Nyasha, a PhD candidate from the Netherlands, writes,

I am evaluating a food aid program for HIV/AIDS afflicted families and individuals in Zambia. This is the data I have:
  1. 4 zones were selected in an urban area. This purposive selection was based on HIV prevalence data, these 4 zones appear to have slightly higher rates than others .
  2. Proxy means testing was then used to select households in these 4 zones. The selection criteria was based on a combination of household and individual characteristics.
  3. I have data from 200 “treated “ households from the 4 selected zones and data from 200 similar “ control” households from 4 zones that were not selected.
  4. Within these households, I am interested in assessing outcomes at individual level. I have personal medical data for HIV patients (300 observations), household consumption data (400 observations) and personal labour supply data for everyone in each household (1935 observations). First chapter of my thesis looks at patient level data, second chapter looks at household consumption data and the last one looks at individual labour supply data.
A few questions:
  • How I can proceed with creating propensity scores. I would like analyse the program’s impact on both household level outcomes and individual level outcomes. Do I use a propensity score model defined at household level when I am assessing individual outcomes? Is it justified to use three different propensity score models for each level of analysis?
  • How do I incorporate the geographical selection in the matching? When I include the zone dummies in the logit model (psmatch2), I have a separation problem, with 6 of the dummies being dropped from the model. Should I continue to use this logit model? Or should I completely drop the zone dummies?
  • What do I do with HIV rate. The difference is there between selected zones and non-selected zones. However HIV rates do not appear to be correlated with any of my outcomes. I was thinking of using it as an instrument in IV regressions? Is it a justifiable instrument, especially when the variable is only available for 8 cluster zones? My preliminary diagnostic tests show its a valid instrument.
  • Can I also include the HIV rate variable as a cluster/geographic level covariate in PSM? Or do I exclude it as it appears more to be an instrument?

Some of the questions are probably best posed to someone working in your discipline. But let me respond to the general question about matching on multilevel data. From what I understand, your treated and control individuals are from different communities. Whether or not this is the case, there are different ways to do the matching with multilevel data:

One way is to first use community level data to match communities that include treated people with communities that didn’t. Then, across each of the matched pairs or matched sets, find individual level matches for each of the treated individuals by drawing controls only from the matched community. That would be a two-stage matching approach, and it makes sense if you think that community level factors are really important.

Another way is to simply load in the community level variables along with the individual level variables and match on everything at the same time. It makes sense when you think that community level variables are no more or less important than individual level variables.

In practice, the two approaches may generate nearly identical solutions. But such may not be the case for you, in which case you need to decide whether you think the community level variables are of paramount importance or not.

The matching can be done with pscores, coarsened exact matching, nearest neighbor, genetic matching or something else—whatever you like. There are benefits and downsides to each. I have used genetic matching because in theory it obtains the best outcome that either pscores or mahalanobis distance nearest neighbor matching can obtain. I have also used coarsened exact matching because of its transparency and ease of interpretation. Another alternative would be to use a generalized weighting algorithm, but I don’t think there is readily available software for it yet (although some of Jens Hainmueller’s current work seems to be promising).

On some of your other question: the separation problem with the zone dummies seems to be due to the fact that some zones had no treated or no controls. If there are some zones that have both, you might restrict your analysis to those zones. You might then do another analysis that then adds in matched data created according to the first option above.

Indeed, if you think something has the properties of an instrument, then you do not want to include it in the matching algorithm. That can result in bias.

Nyasha followed up,

Another question I have is this- four zones or communities in my data only have treated only. The other four only have controls. That is why I think I am encountering the separation problem when I add community dummies into the logit equation.

Most certainly that will create such problems.

I have few other community observed characteristics which I have included already, but how best do I then control for unobserved effects (especially endogenous program placement) at community level, if I cannot include the community dummies in the propensity score model?

Alas, you cannot. You need to make an assumption that the measured covariates capture all of the relevant differences between the communities, and then match using these measured covariates.

Should I also carry out further regression analysis on the matched sample, where I then include the community dummies (as fixed effects)?

You will not be able to do this because of perfect collinearity with the treatment indicators. There is really no way to account for unmeasured community level factors. The best you can do is use the measured information to match, and you can also include these community level covariates whatever regressions you use. Then, perhaps you can conduct a sensitivity analysis.

Or should I also look into using an IV with community fixed effects? Would this work for cross section data?

Again, you won’t be able to do it because the community fixed effects will be perfectly collinear with the treatment indicator, so a second stage regression with community fixed effects would not be identified.

Did I miss something here?

Share

letter to Senators regarding USIP defunding

UPDATE: A petition is available online: link.

Dear Senator __________,

The House has voted to cut funding for the United States Institute of Peace. I write to urge you not to make the same error.

What USIP provides is a means by which the government ensures itself access to a diverse, independent, and up-to-date pool of expert knowledge on conflicts around the world. The current events in the Arab world should make it clear how important it is to have such a resource.

USIP supports independent researchers who help our government to navigate adeptly a turbulent and complex world. Adept management of Americans’ foreign affairs, like clean air, is a classic “public good.” Thus, it requires government support. It is a basic economic principle that private market forces will fail in providing these kinds of goods. No private market actor is capable of internalizing all of the relevant costs and benefits. Therefore, no private market actor will find it in their interest to look after the national interest in an efficient and effective manner. USIP is an important means by which our government looks after Americans’ well-being.

The internal mechanisms for knowledge creation in the government are inadequate on their own and cannot substitute for the diverse network of researchers that USIP brings together.

The gap left by the potential removal of USIP, a nonpartisan agency that helps guide research in a direction serving our country’s goals overseas, will most certainly be filled by considerably less reliable partisan voices and business interests. This would handicap our government’s foreign policy.

USIP’s budget is a minuscule fraction of overall spending, but its impact is great. The cost benefit equation is clearly on the side of sustaining USIP support, especially when considered relative to other items that draw on considerably more government funds with considerably less reward.

I hope you will make the wise decision and reject any move to defund USIP.

Sincerely,

Cyrus D. Samii
Fellow, MacMillan Center, Yale University
Assistant Professor (as of July 2011), Politics Department, New York University

Share

(technical) yet more on clustering and standard errors: clustering in the regressors

A little technical note on how correlation in regressors, which can be measured, can sometimes provide guidance in choosing what kind of standard error to use: correlation_in_x110217

This is pretty much straight Moulton factor (why is there no Wikipedia entry on Moulton factor to link to?). I still need to reconcile this stuff with what I had shown a few months ago about how the Moulton factor leads to the wrong conclusion in the context of correlated potential outcomes, even if treatment assignment is at the unit level (yes, I said unit level, not cluster level). For a refresher on that, see here: link1 link2. A big difference in the document that is attached to this post is that we are looking at a vanilla “constant effects” set-up, whereas the potential outcomes stuff was agnostic on unit-by-unit differences in effect sizes.

Share

because we feel like it

Sometimes we just “feel” like doing something. By my reading of recent neuroscience, these situations may arise because somewhere in our brain there are processes that have determined that this “something” is optimal and the signals from these processes have overwhelmed signals from others that may have come to a contrary conclusion.

Our thoughts and actions are the result of numerous parallel processes. They are sometimes combined in an apparently sensible way giving us the illusion of an integrated self (link). But sometimes they do not come together in a sensible way and so we cannot immediately intuit a reason. We just feel like it.

The manner in which external stimuli and those parallel processes can mix is vast. So our urges to do things may take into account a vast number of dimensions of which we are barely familiar. So long as we let ourselves occasionally take actions because we “feel like it,” these processes reveal a preference ordering that we cannot access intentionally. In doing so we discover features of our inaccessible inner preference ordering. One implication is that we can misjudge ourselves just as much as we can misjudge others (link).

Share