Reading: 7 Properties of Good Models (Gabaix & Laibson, 2008)

This short essay argues that the following criteria should be used to judge whether an analytical economic model is good or not:

  1. parsimony, viz., minimal assumptions and parameters, to reduce risk of overfitting. This would seem to be the essence of modeling, right?
  2. tractability.
  3. conceptual insightfulness, which in the authors’ characterization bears some resemblance to Lakatos’s axiom that a scientific theory should produce “novel facts”.
  4. generalizability.
  5. falsifiability.
  6. empirical consistency.
  7. predictive precision, which is a necessary complement to falsifiability and empirical consistency: a model that makes vague predictions may hold up against the data, but a more useful model might be one that makes sharp predictions that are only slightly off from the data.

The authors acknowledge that these criteria may conflict, forcing trade-offs. Special tensions would seem to arise between parsimony/tractability and falsifiability/empirical consistency/predictive precision.

In their discussion, the authors claim that economic models should not be judged on whether they satisfy optimization axioms. They wish to create space for models that allow a separation between the normative preferences of agents and the actions that they ultimately take—the separation may be due to non-voluntary errors, biases, or emotions. Abandoning optimization axioms means that behavior does not immediately reveal preferences, which complicates normative analysis. The authors accept this, claiming that instead, we should specify models that incorporate parameters capturing non-voluntary processes, and then use data to identify “latent” preferences after conditioning on estimates of these parameters.

Full reference: Gabaix, Xavier, and David I. Laibson. 2008. “The Seven Properties of Good Models.” In The Foundations of Positive and Normative Economics, ed. Andrew Caplin and Andrew Schotter, 292–99. New York: Oxford University Press.

Ungated link:


Nuanced study of local politics and deforestation in Indonesia

From a new working paper on “The Political Economy of Deforestation in the Tropics” by Robin Burgess, Matthew Hansen, Benjamin Olken, Peter Potapov, and Stefanie Sieber (link),

Logging of tropical forests accounts for almost one-fi…fth of greenhouse gas emissions worldwide, significantly degrades rural livelihoods and threatens some of the world’s most diverse ecosystems. This paper demonstrates that local-level political economy substantially affects the rate of tropical deforestation in Indonesia. Using a novel MODIS satellite-based dataset that tracks annual changes in forest cover over an 8-year period, we fi…nd three main results. First, we show that local governments engage in Cournot competition with one another in determining how much wood to extract from their forests, so that increasing numbers of political jurisdictions leads to increased logging. Second, we demonstrate the existence of ““political logging cycles,” where illegal logging increases dramatically in the years leading up to local elections. Third, we show that, for local government officials, logging and other sources of rents are short-run substitutes, but that this a¤ect disappears over time as the political equilibrium shifts. The results document substantial deviations from optimal logging practices and demonstrate how the economics of corruption can drive natural resource extraction.

There’s lots to like about the paper, including a well-identified causal story. (They were lucky that others had already done most of the leg-work needed to demonstrate this.) It is also a timely contribution, as Indonesia is one of the pilot cases for the new global REDD initiative to deal with green house gas build up through forest protection “carbon credits” (link). This kind of “diagnostic” research can determine intervention points that should be targeted by future programs aiming to promote forest conservation. It’s already a long paper, but their case would be strengthened if they provided some narrative accounts that demonstrated the plausibility of their interpretation of the data.


Mechanisms to deal with grade inflation

New York Times covers measures recommended by a UNC committee, led by sociologist Andrew Perrin, to deal with grade inflation (link). The suggestions include issuing a statement on the appropriate proportion of students in each class that should receive A’s and also having students’ transcripts include information on a class’s grade distribution (e.g., the class median grade or the percentage of A’s) next to a student’s grade for that class.

This is an interesting design problem. For graduate school admissions, as grades become less informative as signals of quality, it would seem that the result would be for standardized tests to receive extra weight. This puts a lot of stress on standardized tests, and it’s not clear that, e.g., GREs are up to the job, given that they are meant to screen for such a broad range of application types. Witness the amount of heaping that takes place at the upper end of the score range for the quantitative section of the GRE. Ultimately this introduces a lot of arbitrariness into the graduate admissions process.

The solution of adding extra information to transcripts is reasonable given the constraints. But it passes the buck to admissions committees (and other committees, such as scholarship decision committees) who have to expend the effort to make sense of it all. A question, though, is whether these kinds of transcripts cause students to change their behavior in a way that helps to restore some of the information content in grades. Lot’s of other interesting things to consider as part of the design problem, including how an optimal grading scheme should combine information on a student’s absolute versus relative (to other students in the class) performance.


Clustering, unit level randomization, and insights from multisite trials

Another update to the previous post (link) on clustering of potential outcomes even when randomization occurs at the unit level within clusters: Researching the topic a bit more, I discovered that the literature on “multisite trials” addresses precisely these issues. E.g., this paper by Raudenbush and Liu (2000; link) examines consequences of site-level heterogeneity in outcomes and treatment effects. They formalize a balanced multisite experiment with an hierarchical linear model, $latex Y_{ij} = \beta_{0j} + \beta_{1j}X_{ij} + r_{ij}$ where $latex r_{ij} \sim i.i.d.N(0,\sigma^2)$, and $latex X_{ij}$ is a centered treatment variable (-0.5 for control, 0.5 for treated). In this case, an unbiased estimator for the site-specific treatment effect, $latex \hat \beta_{1j}$, is given by the difference in means between treated and control at site $latex j$, and the variance of this estimator over repeated experiments in different sites is given by, $latex \tau_{11} + 4\sigma^2/n$, where $latex \tau_{11}$ is the variance of the $latex \beta_{1j}$’s over sites, and $latex n$ is the (constant) number of units at each site. Then, an unbiased estimator for the average treatment effect over all sites, $latex 1,\hdots,J$, is simply the average of these site-specific estimates, with variance $latex \frac{\tau_{11} + 4\sigma^2/n}{J}$. What distinguishes this model from the one that I examined in the previous post is that once the site-specific intercept is taken into account, there remains no residual clustering (hence the i.i.d. $latex r_{ij}$’s). Also, heterogeneity in treatment effects is expressed in terms of a simple random effect (implying constant within group correlation conditional on treatment status). These assumptions are what deliver the clean and simple expression of the variance of the site-specific treatment effect estimator, which may understate the variance in the situations that I examined where residual clustering was present. It would be useful to study how well this expression approximates what happens in the more complicated data generating process that I set up.


Regression discontinuity designs and endogeneity

The Social Science Statistics blog posts a working paper by Daniel Carpenter, Justin Grimmer, Eitan Hersch, and Brian Fienstein on possible endogeneity problems in close electoral margins as a source of causal identification in regression discontinuity studies (link).   In their abstract, they summarize their findings as such:

In this paper we suggest that marginal elections may not be as random as RDD analysts suggest. We draw upon the simple intuition that elections that are expected to be close will attract greater campaign expenditures before the election and invite legal challenges and even fraud after the election. We present theoretical models that predict systematic di vergences between winners and losers, even in elections with the thinnest victory margins. We test predictions of our models on a dataset of all House elections from 1946 to 1990. We demonstrate that candidates whose parties hold structural advantages in their district are systematically more likely to win close elections. Our findings call into question the use of close elections for causal inference and demonstrate that marginal elections mask structural advantages that are troubling normatively.

A recent working paper by Urquiola and Verhoogen draws similar conclusions about non-random sorting in studies that use RDDs to study the effects of class size on student performance (link).

The problem here is that the values of the forcing variable assigned to individuals are endogenous to complex processes that, very likely, are based on the anticipated gains or losses associated with crossing the cut-off point that defines the discontinuity.  Though such is not the case in the above examples, it can also be the case that the values of the cut-off are endogenous.  Causal identification requires that the processes determining values of the forcing variable and cut-off are not confounding.  What these papers indicate is that RDD analysts need a compelling story for why this is the case.  (In other words, they need to demonstrate positive identification [link]).

This can be subtle.  As both Carpenter et al and Urquiola and Verhoogen demonstrate, it’s useful to think of this in terms of a mechanism design problem.  Take a simple example drawing on the “original” application of RD: test scores used to determine eligibility for extra tutoring assistance.  Suppose you have two students and they are told that they will take a diagnostic test at the beginning of the year and that the one with the lower score will receive extra assistance during the year, with a tie broken by a coin flip.  At the end of the year they will both take a final exam that determines whether they win a scholarship for the following year.  The mechanism induces a race to the bottom: both students have incentive to flunk the diagnostic test, each scoring 0 actually, in which case they have a 50-50 chance of getting the help that might increase their chances of landing a scholarship.  Interestingly, this actually provides a nice identifying condition.  But suppose only one of the students is quick enough to learn what would be the optimal strategy in this situation and the other is a little slow.  Then the slow student would put in sincere effort, score above 0 and guarantee that the quick-to-learn student got the tutoring assistance.  Repeat this process many times, and you systematically have quick-learners below the “cut-off” and slow learners above it, generating a biased estimate of the average effect of tutoring in the neighborhood of the cut-point.  What you need for the RD to produce what it purports to produce is a mechanism by which sincere effort is induced (and, as Urquiola and Verhoogen have discussed, a test that minimizes mean-reversion effects).

UPDATE: A new working paper by Caughey and Sekhon (link) provides even more evidence about problems with close elections as a source of identification for RDD studies.  They provide some recommendations (shortened here; the full phrasing is available in the paper):

  • The burden is on the researcher to…identify and collect accurate data on the observable covariates most likely to reveal sorting at the cut-point. [A] good rule of thumb is to always check lagged values of the treatment and response variables.
  • Careful attention must be paid to the behavior of the data in the immediate neighborhood of the cut-point.  [Our analysis] reveals that the trend towards convergence evident in wider windows reverses close to the cut-point, a pattern that may occur whenever a…treatment is assigned via a competitive process with a known threshold.
  • Automated bandwidth- and specification-selection algorithms are no sure solution.  In our case, for example, the methods recommended in the literature select local linear regression bandwidths that are an order of magnitude larger than the window in which covariate imbalance is most obvious.
  • It is…incumbent upon the researcher to demonstrate the theoretical relevance of quasi-experimental causal estimates.