Cyrus Samii – Page 27

January 20, 2011

#USAID #evaluation policy & essential features of a rigorous #impact evaluation policy

USAID has just released its new evaluation policy (link). I find this to be a very encouraging development. I have thought until now that USAID has lagged about a decade behind other major international and government development agencies in their evaluation practices. Part of this may be due to the fact that USAID relies extensively on commercial contractors to implement its programs, and so incentives are not well aligned to promote rigorous evaluation. This may continue to be the case, and as I discuss below, the “incentives” issue is the most crucial in making a rigorous evaluation policy work. But the policy statement suggests that USAID is taking some terrific steps forward. Some highlights from the policy document for impact evaluation are (i) that impact evaluation is defined in terms of comparisons against a counterfactual, (ii) an emphasis is placed on working out evaluation details in the project design phase, (iii) the document states unequivocally that experimental methods should be used for impact evaluation whenever feasible, and (iv) the document repeatedly emphasizes replicability.

Nonetheless, a line that concerns me about whether this will all work comes from Adminstrator Shah’s preface:

We will be unbiased, requiring that evaluation teams be led by outside experts and that no implementing partner be solely responsible for evaluating its own activities.

What we need to do is to unpack the idea of “outside experts,” and to ensure that this is not operationalized in a manner that contractors and program staff take to mean “external auditors,” but rather in a manner that translates into “incorruptible, but otherwise sympathetic, partners.” Let me explain.

First, let’s take a step back and ask, what’s needed for a rigorous impact evaluation policy to work for USAID? In my view, the most crucial is to ensure that USAID staff, contractors, and evaluators are poised to take advantage of solid evidence on how programs fail. Yes, fail.

To see why this is the case, consider an idealized version of what an impact evaluation is all about. An impact evaluation is not something to be added onto a program, in many ways, it is the program. That said, it doesn’t always make sense to do an impact evaluation (e.g., it may be excessive for an intervention that has a well-documented record of positive impacts and the intervention is being done at large scale). But when an impact evaluation is on order, what it does is to take a program and fashion it into a scientific hypothesis about how desired outcomes can be produced. The program/hypothesis is derived from assumptions and theories (perhaps latent) that lurk in the minds of those who draft the program proposal. (In that way, the program proposal is a theoretical exercise, and so if academics are engaged to participate in an impact evaluation, they really should be part of the proposal process, even though this does not happen regularly.) When the commissioning agency (e.g. USAID) signs off on the proposal, it is agreeing that this hypothesis is well-conceived. For an impact evaluation, the entire intervention is designed in such a way as to test this program/hypothesis. It may turn out that the hypothesis is false (or, more accurately, that we cannot convincingly demonstrate that it is not false). To demonstrate this, we need to know that the program/hypothesis was operationalized as intended and to have credible evidence of its impact. Then, ideally, incentives should be in place such that the demonstration of a program’s non-effectiveness is rewarded just as much as demonstration of effectiveness. In principle, there is no reason that this should not be the case. After all, the commissioning agency signaled that it found the program/hypothesis to be well-conceived and therefore should be quite willing to accept evidence to the contrary. But one does not get the sense that such negative findings are rewarded. Even when programs are implemented faithfully, my experience is that the prevailing sense is that contractors and all in the chain of the program management need to demonstrate effectiveness.

This being the case, a number of problems arise when “external evaluators” are actually understood by contractors and program staff to be “external auditors.” A first consequence is a bias in the information stream from program staff to evaluators: program staff will have good reason to hide negative information and present only positive information. They will have little reason to facilitate the evaluator’s work, since the harder time the evaluator has in doing a rigorous evaluation, the more tenuous will be the findings of the evaluation, and it’s less risky for a program manager to have a an evaluation with highly ambiguous conclusion than one that may produce clear evidence of benefits, but may just as well produce clear evidence of no benefits or even harm.

What do I see as ways to properly incentive rigorous impact evaluation? Two thoughts come to mind:

Contractors, program staff, and evaluation consultants (especially academics) should have a deep collaborative relationship, working together in developing program proposals that are understood to be well-motivated scientific hypotheses that are to be tested in the field. The idea of deeper collaboration may be counter-intuitive to some. It may sound as if it is creating too “cozy” a relationship between evaluators and practitioners. But so long as credible evidence is genuinely valued by the commissioning agency, I see this kind of deep collaboration as far more promising than the kind of “external evaluation consultant” arrangements that tend to be used currently. This idea is what a previous post was all about (link).
Contractors and USAID staff should be rewarded for being associated with impact evaluations that demonstrate how, in faithfully implemented programs, outcomes differ, sometimes for the worse, from expectations.

January 11, 2011

Practitioners, academics, and development impact evaluation

Micro-Links forum is hosting a three-day online discussion on “Strengthening Evaluation of Poverty and Conflict/Fragility Interventions” (link). Based on my own experience in trying to pull off these kinds of evaluations, I wanted to chime in with some points about the nature of academics’ involvement in impact evaluations. A common approach for impact evaluation is for implementing organizations to team with academics. The nature of the relationship between practitioners and academics is a crucial aspect of the evaluation strategy. Let me propose this: for impact evaluations to work, it’s good to have everyone in the program understand that they are involved in a scientific project that aims to test ideas about “what works.”

A way that I have seen numerous impact evaluations fail is through a conflict that emerges between practitioners who understand their role as helping people irrespective of whether or not the manner in which they do it is good for an evaluation, and academics who see themselves as tasked with evaluating what the program staff are doing with little stake themselves in the program. This “division of labor” might seem rational, but I have found that it introduces tension and other obstacles.

An arrangement that seems to work better is for practitioners and academics to all see themselves as engaged together in (i) the conceptualization of the program itself; (ii) the elaboration of the details about how it should be implemented; and (iii) and then the methods for measuring impact. All three steps should harness both practitioners’ and academics’ technical and substantive knowledge. What does not seem to work very well is to try to force a division between tasks (i) and (ii) on the one hand, which are often “reserved” for practitioners to use their substantive knowledge, and (iii) on the other hand, which is reserved for academics to use their technical knowledge. Often times steps (i) and (ii) will have been worked out among practitioners from the commissioning and implementing agencies, and then academics will be consulted to realize step (iii). In my experience, this process rarely works out. There is a logic that needs to flow from conceptualization of the program to measurement of impacts, and this requires that academics and practitioners work together from square one, conceptualizing “what needs to be done” by the program all the way to designing a means for determining “whether it worked” to produce the desired change. To put it another way, impact evaluations are better when they are seen less as technical exercises tacked on to existing programs and more as rich, substantive exercises to design programs for testing ideas and discovering how to bring about the desired change.

For the academics, a fair representation of this might be for key actors involved in the implementation of the program to be considered as co-authors in at least some of the studies that emerge from the program; indeed, the assumptions and hypotheses on which programs are based are often drawn largely from the experiences and thoughts of members of program staff, and their input should be acknowledged. Along similar lines, practitioners should appreciate that when academics are engaged, their substantive expertise is a resource that should be tapped in developing the program itself, and that there needs to be a logic that holds between the design of the program itself and the evaluation. For all, this is a deeper kind of interaction between academics and practitioners than is often the case, though I should note that, e.g., Poverty Action Lab/IPA projects often operate with such richness and nuance.

January 10, 2011

Desegregation led to an unintended concentration of public resources for blacks and poor? (Reber, 2011)

School desegregation might have induced unintended behavioral responses of white families as well as state and local governments. This paper examines these responses and is the first to study the effects of desegregation on the finances of school districts. Desegregation induced white flight from blacker to whiter public school districts and to private schools, but the local property tax base and local revenue were not adversely affected. The state legislature directed significant new funding to districts where whites were particularly affected by desegregation. Desegregation therefore appears to have achieved its intended goal of improving resources available in schools that black attended.

From an interesting new paper by Sarah Reber in The Review of Economics and Statistics on desegregation achieving its goals via an unintended concentration of public resources (link).

January 10, 2011January 10, 2011

Extraction ratio as a political measure of income inequality (Milanovic et al, 2011)

A Fine Theorem blog (link) points to an interesting new paper on “Pre-Industrial Inequality” by Branko Milanovic, Peter H. Lindert, and Jeffrey G. Williamson forthcoming in The Economic Journal (link). The paper defines a new measure of income inequality called the “extraction ratio.” The extraction ratio is the ratio of the measured Gini coefficient to the “maximum possible Gini coefficient” that would obtain were all available societal surplus to be in the hands of a vanishingly small elite. The motivation for the extraction ratio measure is as follows: Consider a society divided into a poor and rich class, and for illustration suppose that the poor each earn P per year and the rich each earn R (that is, incomes are constant within classes) with P << R. If P is fixed to subsistence level, then the surplus that the rich enjoy relative to the poor (R-P) is a linear function of total income in society. As such, levels of inequality as measured by, say, the Gini coefficient are also a function total income. Two societies may exhibit class stratification that is similarly reprehensible in that in both cases, only the rich enjoy any surplus income over the subsistence level, but they may differ greatly in their Gini measures; or, two societies may have the same Gini coefficient, but differ in whether the poor obtain any surplus over subsistence. The extraction ratio allows one to distinguish these cases. The cases would seem to imply very different political circumstances, with a higher extraction ratio being associated with a more exploitative society, intuitively. The authors find that while the distribution of Gini coefficients does not differ so much between the pre-industrial and post-industrial ages, extraction ratios tended to be quite a bit higher in the pre-industrial age. Many studies attempt to correlate inequality as measured by the Gini coefficient to political outcomes, with very mixed results. It would be interesting to see if this new measure produces different insights.

January 5, 2011

Statistical significance goes before the Supreme Court

The Economist View blog posts a letter from economist Steve Ziliak (link) describing a case due to be argued before the Supreme Court next week on whether “drug manufacturers and other companies [should] be required to report the adverse effect of a product on users, if the effect is not statistically significantly different from zero at the 5% level.” Briefs presented before the court commenting on the case are here (link). The blog post also links to this page, as well as to some of Ziliak’s writing on significance testing.