Mechanisms to deal with grade inflation

New York Times covers measures recommended by a UNC committee, led by sociologist Andrew Perrin, to deal with grade inflation (link). The suggestions include issuing a statement on the appropriate proportion of students in each class that should receive A’s and also having students’ transcripts include information on a class’s grade distribution (e.g., the class median grade or the percentage of A’s) next to a student’s grade for that class.

This is an interesting design problem. For graduate school admissions, as grades become less informative as signals of quality, it would seem that the result would be for standardized tests to receive extra weight. This puts a lot of stress on standardized tests, and it’s not clear that, e.g., GREs are up to the job, given that they are meant to screen for such a broad range of application types. Witness the amount of heaping that takes place at the upper end of the score range for the quantitative section of the GRE. Ultimately this introduces a lot of arbitrariness into the graduate admissions process.

The solution of adding extra information to transcripts is reasonable given the constraints. But it passes the buck to admissions committees (and other committees, such as scholarship decision committees) who have to expend the effort to make sense of it all. A question, though, is whether these kinds of transcripts cause students to change their behavior in a way that helps to restore some of the information content in grades. Lot’s of other interesting things to consider as part of the design problem, including how an optimal grading scheme should combine information on a student’s absolute versus relative (to other students in the class) performance.

Share

Clustering, unit level randomization, and insights from multisite trials

Another update to the previous post (link) on clustering of potential outcomes even when randomization occurs at the unit level within clusters: Researching the topic a bit more, I discovered that the literature on “multisite trials” addresses precisely these issues. E.g., this paper by Raudenbush and Liu (2000; link) examines consequences of site-level heterogeneity in outcomes and treatment effects. They formalize a balanced multisite experiment with an hierarchical linear model, $latex Y_{ij} = \beta_{0j} + \beta_{1j}X_{ij} + r_{ij}$ where $latex r_{ij} \sim i.i.d.N(0,\sigma^2)$, and $latex X_{ij}$ is a centered treatment variable (-0.5 for control, 0.5 for treated). In this case, an unbiased estimator for the site-specific treatment effect, $latex \hat \beta_{1j}$, is given by the difference in means between treated and control at site $latex j$, and the variance of this estimator over repeated experiments in different sites is given by, $latex \tau_{11} + 4\sigma^2/n$, where $latex \tau_{11}$ is the variance of the $latex \beta_{1j}$’s over sites, and $latex n$ is the (constant) number of units at each site. Then, an unbiased estimator for the average treatment effect over all sites, $latex 1,\hdots,J$, is simply the average of these site-specific estimates, with variance $latex \frac{\tau_{11} + 4\sigma^2/n}{J}$. What distinguishes this model from the one that I examined in the previous post is that once the site-specific intercept is taken into account, there remains no residual clustering (hence the i.i.d. $latex r_{ij}$’s). Also, heterogeneity in treatment effects is expressed in terms of a simple random effect (implying constant within group correlation conditional on treatment status). These assumptions are what deliver the clean and simple expression of the variance of the site-specific treatment effect estimator, which may understate the variance in the situations that I examined where residual clustering was present. It would be useful to study how well this expression approximates what happens in the more complicated data generating process that I set up.

Share

Regression discontinuity designs and endogeneity

The Social Science Statistics blog posts a working paper by Daniel Carpenter, Justin Grimmer, Eitan Hersch, and Brian Fienstein on possible endogeneity problems in close electoral margins as a source of causal identification in regression discontinuity studies (link).   In their abstract, they summarize their findings as such:

In this paper we suggest that marginal elections may not be as random as RDD analysts suggest. We draw upon the simple intuition that elections that are expected to be close will attract greater campaign expenditures before the election and invite legal challenges and even fraud after the election. We present theoretical models that predict systematic di vergences between winners and losers, even in elections with the thinnest victory margins. We test predictions of our models on a dataset of all House elections from 1946 to 1990. We demonstrate that candidates whose parties hold structural advantages in their district are systematically more likely to win close elections. Our findings call into question the use of close elections for causal inference and demonstrate that marginal elections mask structural advantages that are troubling normatively.

A recent working paper by Urquiola and Verhoogen draws similar conclusions about non-random sorting in studies that use RDDs to study the effects of class size on student performance (link).

The problem here is that the values of the forcing variable assigned to individuals are endogenous to complex processes that, very likely, are based on the anticipated gains or losses associated with crossing the cut-off point that defines the discontinuity.  Though such is not the case in the above examples, it can also be the case that the values of the cut-off are endogenous.  Causal identification requires that the processes determining values of the forcing variable and cut-off are not confounding.  What these papers indicate is that RDD analysts need a compelling story for why this is the case.  (In other words, they need to demonstrate positive identification [link]).

This can be subtle.  As both Carpenter et al and Urquiola and Verhoogen demonstrate, it’s useful to think of this in terms of a mechanism design problem.  Take a simple example drawing on the “original” application of RD: test scores used to determine eligibility for extra tutoring assistance.  Suppose you have two students and they are told that they will take a diagnostic test at the beginning of the year and that the one with the lower score will receive extra assistance during the year, with a tie broken by a coin flip.  At the end of the year they will both take a final exam that determines whether they win a scholarship for the following year.  The mechanism induces a race to the bottom: both students have incentive to flunk the diagnostic test, each scoring 0 actually, in which case they have a 50-50 chance of getting the help that might increase their chances of landing a scholarship.  Interestingly, this actually provides a nice identifying condition.  But suppose only one of the students is quick enough to learn what would be the optimal strategy in this situation and the other is a little slow.  Then the slow student would put in sincere effort, score above 0 and guarantee that the quick-to-learn student got the tutoring assistance.  Repeat this process many times, and you systematically have quick-learners below the “cut-off” and slow learners above it, generating a biased estimate of the average effect of tutoring in the neighborhood of the cut-point.  What you need for the RD to produce what it purports to produce is a mechanism by which sincere effort is induced (and, as Urquiola and Verhoogen have discussed, a test that minimizes mean-reversion effects).

UPDATE: A new working paper by Caughey and Sekhon (link) provides even more evidence about problems with close elections as a source of identification for RDD studies.  They provide some recommendations (shortened here; the full phrasing is available in the paper):

  • The burden is on the researcher to…identify and collect accurate data on the observable covariates most likely to reveal sorting at the cut-point. [A] good rule of thumb is to always check lagged values of the treatment and response variables.
  • Careful attention must be paid to the behavior of the data in the immediate neighborhood of the cut-point.  [Our analysis] reveals that the trend towards convergence evident in wider windows reverses close to the cut-point, a pattern that may occur whenever a…treatment is assigned via a competitive process with a known threshold.
  • Automated bandwidth- and specification-selection algorithms are no sure solution.  In our case, for example, the methods recommended in the literature select local linear regression bandwidths that are an order of magnitude larger than the window in which covariate imbalance is most obvious.
  • It is…incumbent upon the researcher to demonstrate the theoretical relevance of quasi-experimental causal estimates.
Share

“Psi” as a teachable moment on methods and data analysis

In case you haven’t followed the chatter about Daryl Bem’s forthcoming paper on evidence of “precognition” and “premonition” (a.k.a. “psi” effects, or more colloquially, psychic intelligence), you can read a synopsis at the Freakonomics blog (link).  The comments on the blog page are quite amusing.  More interesting is how Wagenmakers et al. have leapt on this as a “teachable moment” for discussing perils and pitfalls in commons modes of contemporary data analysis. Continue reading ““Psi” as a teachable moment on methods and data analysis”

Share