matching with multilevel data, discussing some strategies

Nyasha, a PhD candidate from the Netherlands, writes,

I am evaluating a food aid program for HIV/AIDS afflicted families and individuals in Zambia. This is the data I have:
  1. 4 zones were selected in an urban area. This purposive selection was based on HIV prevalence data, these 4 zones appear to have slightly higher rates than others .
  2. Proxy means testing was then used to select households in these 4 zones. The selection criteria was based on a combination of household and individual characteristics.
  3. I have data from 200 “treated “ households from the 4 selected zones and data from 200 similar “ control” households from 4 zones that were not selected.
  4. Within these households, I am interested in assessing outcomes at individual level. I have personal medical data for HIV patients (300 observations), household consumption data (400 observations) and personal labour supply data for everyone in each household (1935 observations). First chapter of my thesis looks at patient level data, second chapter looks at household consumption data and the last one looks at individual labour supply data.
A few questions:
  • How I can proceed with creating propensity scores. I would like analyse the program’s impact on both household level outcomes and individual level outcomes. Do I use a propensity score model defined at household level when I am assessing individual outcomes? Is it justified to use three different propensity score models for each level of analysis?
  • How do I incorporate the geographical selection in the matching? When I include the zone dummies in the logit model (psmatch2), I have a separation problem, with 6 of the dummies being dropped from the model. Should I continue to use this logit model? Or should I completely drop the zone dummies?
  • What do I do with HIV rate. The difference is there between selected zones and non-selected zones. However HIV rates do not appear to be correlated with any of my outcomes. I was thinking of using it as an instrument in IV regressions? Is it a justifiable instrument, especially when the variable is only available for 8 cluster zones? My preliminary diagnostic tests show its a valid instrument.
  • Can I also include the HIV rate variable as a cluster/geographic level covariate in PSM? Or do I exclude it as it appears more to be an instrument?

Some of the questions are probably best posed to someone working in your discipline. But let me respond to the general question about matching on multilevel data. From what I understand, your treated and control individuals are from different communities. Whether or not this is the case, there are different ways to do the matching with multilevel data:

One way is to first use community level data to match communities that include treated people with communities that didn’t. Then, across each of the matched pairs or matched sets, find individual level matches for each of the treated individuals by drawing controls only from the matched community. That would be a two-stage matching approach, and it makes sense if you think that community level factors are really important.

Another way is to simply load in the community level variables along with the individual level variables and match on everything at the same time. It makes sense when you think that community level variables are no more or less important than individual level variables.

In practice, the two approaches may generate nearly identical solutions. But such may not be the case for you, in which case you need to decide whether you think the community level variables are of paramount importance or not.

The matching can be done with pscores, coarsened exact matching, nearest neighbor, genetic matching or something else—whatever you like. There are benefits and downsides to each. I have used genetic matching because in theory it obtains the best outcome that either pscores or mahalanobis distance nearest neighbor matching can obtain. I have also used coarsened exact matching because of its transparency and ease of interpretation. Another alternative would be to use a generalized weighting algorithm, but I don’t think there is readily available software for it yet (although some of Jens Hainmueller’s current work seems to be promising).

On some of your other question: the separation problem with the zone dummies seems to be due to the fact that some zones had no treated or no controls. If there are some zones that have both, you might restrict your analysis to those zones. You might then do another analysis that then adds in matched data created according to the first option above.

Indeed, if you think something has the properties of an instrument, then you do not want to include it in the matching algorithm. That can result in bias.

Nyasha followed up,

Another question I have is this- four zones or communities in my data only have treated only. The other four only have controls. That is why I think I am encountering the separation problem when I add community dummies into the logit equation.

Most certainly that will create such problems.

I have few other community observed characteristics which I have included already, but how best do I then control for unobserved effects (especially endogenous program placement) at community level, if I cannot include the community dummies in the propensity score model?

Alas, you cannot. You need to make an assumption that the measured covariates capture all of the relevant differences between the communities, and then match using these measured covariates.

Should I also carry out further regression analysis on the matched sample, where I then include the community dummies (as fixed effects)?

You will not be able to do this because of perfect collinearity with the treatment indicators. There is really no way to account for unmeasured community level factors. The best you can do is use the measured information to match, and you can also include these community level covariates whatever regressions you use. Then, perhaps you can conduct a sensitivity analysis.

Or should I also look into using an IV with community fixed effects? Would this work for cross section data?

Again, you won’t be able to do it because the community fixed effects will be perfectly collinear with the treatment indicator, so a second stage regression with community fixed effects would not be identified.

Did I miss something here?

Share

3 Replies to “matching with multilevel data, discussing some strategies”

  1. Great post, very informative. May I ask you a question on the same topic? Suppose I want to include only one or two random intercepts into my model (e.g., TSCS), instead of community-level variables as Nyasha did. Should I add the country/year indicators in the matching equation (as in fixed effect models) and again when running the multilevel analysis with the matched data? Am I missing something? Thanks a lot!

  2. If I understand, Danilo, you are proposing to incorporate country and year fixed effects into the analysis. In that case, assuming you have many micro-observations for each country-year combination, you could match on micro-level covariates within those country-year bins (so, exact matching on country-year indicators and then whatever other algorithm you want to use to match on micro covariates).

  3. Thanks for your reply, Cyrus. Yes, that’s exactly what I meant, and your suggestion makes a lot of sense. I’ll keep it in mind! Thanks again!

Comments are closed.