# Combining information from surveys that use multiple modes of contact

Over at the Statistical Modeling, Causal Inference, and Social Science blog (link), Andrew Gelman writes,

I’m involved (with Irv Garfinkel and others) in a planned survey of New York City residents. It’s hard to reach people in the city–not everyone will answer their mail or phone, and you can’t send an interviewer door-to-door in a locked apartment building. (I think it violates IRB to have a plan of pushing all the buzzers by the entrance and hoping someone will let you in.) So the plan is to use multiple modes, including phone, in person household, random street intercepts and mail.

The question then is how to combine these samples. My suggested approach is to divide the population into poststrata based on various factors (age, ethnicity, family type, housing type, etc), then to pool responses within each poststratum, then to runs some regressions including postratsta and also indicators for mode, to understand how respondents from different modes differ, after controlling for the demographic/geographic adjustments.

Maybe this has already been done and written up somewhere?

It’s interesting to consider this problem by combining a “finite population” perspective with some ideas about “principal strata” from the causal inference literature. Suppose a finite population U from which we draw a sample of N units. We have two modes of contact, A and B. Suppose for the moment that each unit can be characterized by one of the following response types (these are the “principal strata”):

Type Mode A response Mode B response
I 1 1
II 1 0
III 0 1
IV 0 0

Then, there are two cases to consider, depending on whether mode of contact affects response:

Mode of contact does not affect response

This might be a valid assumption if the questions of interest are not subject to social desirability biases, interviewer effects, etc. In this case, it is easy to define a target parameter as the average response in the population. You could proceed efficiently by first applying mode A to the sample, and then applying mode B to those who did not respond with mode A. At the end, you would have outcomes for types I, II, and III units, and you’d have an estimate of the rate of type IV units in the population. You could content yourself with an estimate for the average response on the type I, II, and III subpopulation. If you wanted to recover an estimate of the average response for the full population (including type IV’s), you would effectively have to impute values for type IV respondents. This could be done by using auxiliary information either to genuinely impute or (in a manner that is pretty much equivalent) to determine which type I, II, or III units resemble the missing type IV units, and up-weight. In any case, if the response of interest has finite support, one could also compute “worst case” (Manski-type) bounds on the average response by imputing maximum and minimum values to type IV units.

Mode of contact affects response

This might be relevant if, for example, the modes of contact are phone call versus face-to-face interview, and outcomes being measured vary depending on whether the respondent feels more or less exposed in the interview situation. This possibility makes things a lot trickier. In this case, each unit is characterized by a response under mode A and another under mode B (that is, two potential outcomes). One immediately faces a quandary of defining the target parameter. Is it the average of responses under the two modes of contact? Maybe it is some “latent” response that is imperfectly revealed under the two modes of contact? If so, how can we characterize this “imperfection”? Furthermore, only for type I individuals will you be able to obtain information on both potential responses. Does it make sense to restrict ourselves to this subpopulation? If not, then we would again face the need for imputation. A design that applied both mode A and mode B to the complete sample would mechanically reveal the proportion of type I units in the population, and by implication would identify the proportion of type II, III, and IV units. For type II units we could use mode A responses to improve imputations for mode B responses, and vice versa for type III respondents. Type IV respondents’ contributions to our estimate of the “average response” would be based purely on auxiliary information. Again, one could construct worst case bounds by imputing maximum and minimum response values for each of the missing response types.

One wrinkle that I ignored above was that the order of modes of contact may affect either response behavior or outcomes reported. This multiplies the number potential response behaviors and the number of potential outcome responses given that the unit is interviewed. You could get some way past these issues by randomizing the order of mode of contact—e.g. A then B for one half, and B then A for the other half. But you would have to impose some more assumptions to make use of this random assignment. E.g., you’d have to assume that A-then-B always-responders are exchangeable with B-then-A always responders in order to combine the information from the always-responders in each half-sample. Or, you could “shift the goal posts” by saying that all you are interested in is the average of responses from modes A and B under the A-then-B design.

Update:

The above analysis did not explore how other types of assumptions might help to identify the population average. Andy’s proposal to use post-stratification and regressions relies (according to my understanding) on the assumption potential outcomes are independent of mode of contact conditional on covariates. Formally, if the mode of contact is $latex M$ taking on values $latex A$ or $latex B$, potential outcomes under mode of contact $latex m$ is $latex y(m)$, $latex T$ is principal stratum, and $latex X$ is a covariate, then $latex \left[y(A),y(B)\right] \perp M | T, X$ implies that,

$latex E(y(m)|T,X) = E(y(m)|M=m, T,X) = E(y(m)|M \ne m, T,X)$.

As discussed above, the design that applies modes A and B to all units in the sample can determine principal stratum membership, and so these covariate- and principal-stratum specific imputations can be applied. Ordering effects will again complicate things, and so more assumptions would be needed. A worthwhile type of analysis would be to study evidence of mode-of-contact as well as ordering effects among the type I (always-responder) units.

Now, it may be that mode of contact affects response but units are contacted via either mode A or B. Then, a unit’s principal stratum membership is not identifiable, nor is the proportion of types I through IV identifiable (we would end up with two mixtures of responding and non-responding types, with no way to parse out relative proportions of the different types). If some kind of response “monotonicity” held, then that would help a little. Response monotonicity would mean that either type II or type III responders didn’t exist. Otherwise, we would have to impose more stringent assumptions. The common one would be that principal stratum membership is independent of potential responses conditional on covariates. This is a classic “ignorable non-response” assumption, and it suffers from having no testable implications.

# Michela Wrong on corruption, ethnicity, and development in Kenya

Within minutes of the announcement of Kibaki’s victory, the multi-ethnic settlements of Nairobi, Mombasa, Kisumu, Eldoret and Kakamega erupted. Luo and Luhya ODM supporters armed with metal bars, machetes and clubs vented their frustration and fury on local Kikuyu and members of the smaller, pro-PNU Meru, Embu and Kisii tribes, setting fire to homes and shops. The approach was brutally simplistic. Many Kikuyu, especially the young, urban poor, had actually voted ODM, regarding Raila, ‘the People’s President’, as far more sympathetic to their needs than the aloof Kibaki. But mobs don’t do nuance. Fury needs a precise shape and target if it is to find expression, and ethnicity provided that fulcrum.

[U]nder a system which decreed that all advancement was determined by tribe, such hostility was entirely rational. Had all Kenyans believed they enjoyed equal access to state resources, there would have been no explosion.

Nowhere was this dawning of ethnic self-awareness more sudden than in the slums, Kenya’s melting pots, where new frontiers coagulated like DNA strands, forming as suddenly on the ground as they had in people’s minds. The notion that urban youth would serve as midwives to the birth of a cosmopolitan, united nation looked like idealistic nonsense–the worst violence took place in places like Kibera and Mathare, and it was committed by youngsters.

In the space of only two months, Kenya had changed beyond recognition. Rolling back the migration trends of half a century, a process of self-segregation was under way.

‘The generation that harboured that kind of ethnic hatred was dying away,’ says John Kiriamiti. A former bank robber, he renounced crime to become a respectable newspaper publisher in Muranga, and now quails at the violence he once took in his stride. ‘Our children didn’t know about it. But they have understood it now, and it will take a long, long time to vanish.’

Quotes on the 2007-8 electoral crisis in Kenya from Michaela Wrong’s It’s Our Turn to Eat (link), which I just finished. Most of the book follows the saga of Kibaki’s former anti-corruption adviser and whistleblower, John Githongo. It makes for a gripping narrative through which Wrong provides some nice insights on how an ethnic winner-take-all mentality has undermined Kenyan democratic politics and created pressures that erupted in the 2007-8 electoral crisis. The question naturally arises: what institutional reforms might allow a society to transcend a perilous inter-ethnic dynamic such as this? Are quotas or integrative institutions useful or harmful? This is part of some new research that Elisabeth King and I are currently undertaking. Watch this space for updates.

# Advice from G. Imbens on design & analysis of RCTs

At the International Initiative for Impact Evaluation (3ie) conference in Cuernavaca this past week (link), Guido Imbens from Harvard University gave two lectures on standards for the design and analysis of randomized control trials (RCTs). I thought it was worth a rather lengthy post to describe what he covered. These are very important insights, and only a portion of what he discussed is posted to the web. The first lecture drew mostly from his book manuscript with Donald Rubin, while a paper associated with the second lecture is here (link updated link). The summary below is based on my understanding of his lectures.

Imbens’s first lecture focused on analyzing data from RCTs using randomization-based methods. The standard analysis that Imbens proposes includes (1) a Fisher-type permutation test of the sharp null hypothesis—what Imbens referred to as “testing”—along with a (2) Neyman-type point estimate of the sample average treatment effect and confidence interval—what Imbens referred to as “estimation.” (See these past posts for some discussion of related points: Fisher style testing and Neyman style estimation.) Interestingly, and in a way that runs contrary to Rosenbaum’s proposed method of analysis (link), Imbens claimed that testing and estimation are separate enterprises with separate goals and that the two should not be confused. I took it as a warning against proposals that use “inverted” tests in order to produce point estimates and confidence intervals. There is no reason that such confidence intervals will have accurate coverage except under rather dire assumptions, meaning that they are not “confidence intervals” in the way that we usually think of them. (This is a point that Peter Aronow and I have developed formally in a paper that hopefully will be published soon.) Thus, Imbens’s suggestion was that a rank-based test was a good choice for null hypothesis testing, owing to its insensitivity to outliers and relative power (i.e., its Pittman efficiency), but that estimation should be based on sample theoretic (Neyman-type) principles. In most practical cases, ordinary least squares (OLS) regression with robust standard errors produces estimates that are, in fact, justified on sample theoretic grounds (even if “randomization does not justify regression assumptions”, as Freedman famously noted).

Imbens devoted a good amount of time in the first lecture to special challenges in analyzing cluster randomized experiments. Cluster randomized trials present analysts with a choice between estimating a treatment effect defined at the level of individual units versus a treatment effect defined at the cluster level. Typically it is the former (unit-level) treatment effect that interests us. However, unbiased estimation is complicated by the fact that cluster-level assignment, combined with variation in cluster sizes, implies variation in unit-level treatment assignment propensities. (NB: What matters is the size of the cluster in the population, not the size of the sample from the cluster. Only under extremely artificial circumstances would cluster sizes ever be equal, in which case these problems are pretty much always relevant in cluster randomized trials.) One may use weights to account for these differences in assignment propensities, but this introduces other issues: simply weighting by relative cluster size introduces a scale invariance problem. Normalizing the weights removes this problem but introduces a ratio estimation problem that creates finite sample bias. (The ratio estimation problem arises because the normalization depends on the average cluster sizes in the treated and control groups, which is random.) However, the bias is typically small in moderate or large samples. The cluster level analysis has none of these problems, since aggregating to the cluster level results in a simple randomized experiment. Given these issues, Imbens’s recommendation was to do both a cluster-level analysis and unit-level analysis, with testing and power analysis focused on the cluster level analysis, and estimation carried out at both levels, using normalized weights for the unit-level analysis. (The issue has also received a nice, thorough treatment by Aronow and Middleton in a paper presented at MPSA last spring—link. They come up with other randomization-based strategies for analyzing cluster randomized trials.)

The second lecture was on design considerations, focusing primarily on (1) optimal design for cluster randomized trials and (2) approaches to “re-randomization” and generalized pre-treatment covariate balancing. On the design of cluster randomized trials, Imbens reviewed an ongoing debate in the statistical literature, in which (1) Snedecor/Cochran and Box/Hunter/Hunter have proposed that in cluster RCTs with small or moderate numbers of clusters, we might have good reason to avoid stratification or pair matching, while (2) Donner/Klar have proposed that some stratification is almost always useful but that pair matching creates too many analytic problems to be worth it, while most recently, (3) Imai/King/Nall have proposed that pair-matching should always be done when possible. Imbens ultimately comes out on the side of Donner/Klar. His take is that the very existence of the debate is based largely on the confusion between testing and estimation: stratification (with pair matching as the limit) always makes estimation more efficient, but under some situations may result in reduced power for testing. This may seem paradoxical, but remember that that power is a function of how a design affects the variability in point estimates (efficiency, or $latex V(\hat \beta)$) relative to the effects of the design on the variability of the estimates of the variance of point estimates (that is, $latex V[\hat V(\hat \beta)]$). This is apparent when you think of power in terms of the probability of rejecting the null using a t-statistic, for which the point estimate is in the numerator and square root of the variance estimate is in the denominator. Noise in both the numerator and denominator affect power. Even though stratification will always reduce the variability of the point estimate, in some cases the estimate of the variance of this point estimate can become unstable, undermining power.

That being the case, clear and tractable instances under which stratification leads to a loss in power arise only in very rigidly defined scenarios. Imbens demonstrated a case with constant treatment effects, homoskedasticity, all normal data, and an uninformative covariate in which a t-test loses power from stratification. But loosening any of these conditions led to ambiguous results, as did the replacement of the t-test with permutation tests. These pathological cases are not a reliable basis for selecting a design. Rather, a more reliable basis is the fact that stratification always improves efficiency. The upshot is that some pre-treatment stratification (i.e., blocking) is always recommended, and this is true whether or not the trial is cluster randomized or unit randomized.

The question then becomes, how much stratification? Here, Imbens disagreed with Imai/King/Nall by proposing that pair-matching introduces more analytical problems than its worth relative to lower-order stratification. The problem with pair matching is that the within-pair variance for the estimated treatment effect is unidentified (you need at least two treated and two control units to identify the estimated treatment effect variance). Thus, one must use the least upper bound identified by the data, which is driven by the between-pair variance. This leads to an overconservative estimator that tosses out efficiency gains from pairing. It also prevents examination of heterogeneity that may be of interest. Thus, Imbens’s recommendation was stratification up to at least two-treated and two-control units or clusters per stratum.

(I asked Imbens how he would propose to carry out the stratification, especially when data are of mixed types and sparseness was a potential problem. His recommendation was dimension reduction. Thus, one could use predicted baseline values from a rich covariate model, clustering algorithms constrained by baseline values, or slightly more flexible clustering. The goal is to reduce the stratification problem to one or a few manageable dimensions.)

The lecture on design also covered some new (and rather inchoate) ideas on “re-randomization” and generalized covariate balancing. The problem that he was addressing was the situation where you perform a randomization, but then note problematic imbalances in the resulting experiment. Should you rerandomize? If so, what should be the principles to guide you? Imbens’s take was that, yes, you should rerandomize, but that it should be done in a systematic manner. The process should (1) define, a priori, rejection rules for assignments, (2) construct the set of all possible randomizations (or an arbitrarily large sample of all possible randomizations), (3) apply the rejection rule, and then (4) perform the actual treatment assignment by randomizing within this set of acceptable randomizations. This type of “restricted randomization” typically uses covariate balance criteria to form the rejection rules; as such, it generalizes the covariate balancing objective that typically motivates stratification.

Some complications arise however. There is no clear basis on which one should define a minimum balance criterion. One could seek to maximize the balance that would obtain in an experiment. Doing so, one would effectively partition the sample into two groups that are maximally balanced, and randomly select one of the two partitions to be the treatment group. But in this case, the experiment would reduce to a cluster randomized experiment on only two clusters! There would be a strong external validity cost, and testing based on permutation would be degenerate. I asked Imbens about this, and his recommendation was that the rejection rules used to perform restricted randomization should maintain a “reasonable” degree of randomness in the assignment process. I think the right solution here is still an open question.

Imbens’s lectures are part of 3ie’s efforts toward standardization in randomized impact evaluations—standardization that greatly facilitates systematic reviews and meta-analyses of interventions, which is another of 3ie’s core activities (link). For those involved in RCTs and field experiments in development, I highly recommend that you engage 3ie’s work.