Survey experiments and the credibility revolution

Carolina Torreblanca, William Dinneen, Guy Grossman, and Yiqing Xu have an impressive new study on trends in quantitative research design in political science—specifically, on whether the field has truly undergone a “credibility revolution.”
Read the working draft here.

One finding jumps off the page: the sheer rise of survey experiments. By their count, survey experiments now make up nearly half of all
design-based explanatory quantitative research—the category they treat as the hallmark of credibility-revolution work.

That conclusion deserves more scrutiny. I do not think survey experiments, in general, are emblematic of the credibility revolution. Some are already treating them as such, and that risks muddling two distinct intellectual streams. If we want to track the credibility revolution, we need at least to separate survey experiments that estimate effects of real-world interventions from those that are essentially measurement tools or light-touch priming exercises, although even then I am skeptical about the connection to the credibility revolution.

To be clear: I use survey experiments, RCTs, observational-causal methods, and descriptive analysis in my own work. This is not about elevating or denigrating any one method. It is about clarifying what belongs to which intellectual lineage. For a broader statement of this view, see my “problem solving” essay:
read the preprint here.

Intellectual origins matter

Small priming experiments have a long history in psychology and behavioral science. They are the direct ancestors of today’s survey experiments. Political science has always been an eclectic discipline, drawing from psychology, sociology, economics, and more. Even without the identification-strategy turn, political science would still have survey experiments—perhaps fewer, but they would be here. Their growth is better understood as a side effect of the credibility revolution, not a direct indication of its spread.

This is why Angrist and Pischke never talk about survey experiments. Their work is about how to study big, messy policy processes with disciplined empirical strategies. As Adi Dasgupta put it on BlueSky (@adasgupta.bsky.social): the “credibility revolution was about empirical strategies to overcome real-world endogeneity.” Survey experiments are simply not what the credibility revolution was proposing.

The credibility revolution pushes against “only experiments are causal”

Many who do survey experiments convey an old-school psychology view: only tightly controlled experiments identify causal effects; observational work does not, and field experiments are too messy. The credibility revolution directly challenges that mindset. It showed that careful observational designs and real-world RCTs can credibly identify, albeit with well understood limitations, the effects of major interventions and policies—problems once assumed to be too uncontrolled to study rigorously.

Some psychologists today are trying to bring this insight into their own field: as I understand it, they are trying to usher in a credibility revolution in psychology by moving colleagues beyond the “experiments or nothing” framework and teaching the logic of identification strategies.

Two categories of survey experiments

Most survey experiments fall into one of two buckets:

  1. Measurement-oriented experiments
    These include conjoint designs, list experiments, endorsement experiments, randomized-response techniques, and many priming interventions. They are invaluable tools. But their goal is not to estimate the effects of real-world interventions or policies.
  2. Survey-platform RCTs that mimic real interventions
    These are experiments targeting causal effects relevant to policy design. They belong on a different track and have a stronger relationship to the credibility revolution, but even then do not necessarily reflect the willingness to try to study change in the real world.

These various distinctions matter. Lumping everything together inflates and obscures the advance of the credibility revolution, which is about empirical strategies that credibly estimate causal effects in complex real-world settings.

Reporting the treated group mean along with DID estimates of the ATT

A student and I were discussing what descriptive statistics to report in a regression table—alongside effect estimates from a DID study—to help readers better interpret the findings. As an analogy, in a randomized experiment we often report the mean of the control group to contextualize the treatment effect estimate (e.g., to interpret it as a percentage change or in terms of outcome levels).

We noted that reporting either the pre- or post-treatment *control* group mean does not make much sense for a standard DID targeting the ATT, since the levels of the control group outcomes have no particular relevance to the magnitude of the treatment effect. In the toy example below, for instance, the magnitude of the estimated treatment effect (the ATT) is larger than both the pre- and post-treatment control group means, even though the outcomes are restricted to be positive. But this is not, in itself, a problem.

Difference-in-differences example graph

For a standard DID study targeting an ATT, it seems most appropriate to report the *treated group’s post-treatment* mean. What you would be saying is: “Here is what we observe for the treated group. The ATT tells us how the observed value compares to what would have happened, counterfactually, had the treatment not been applied.”

In the toy sketch, if we were reporting event-study estimates, then for period 3 we would report the treated group mean of zero alongside the ATT estimate of −8. This communicates that we observe a level of 0 for the treated group, and that this is 8 units lower than what we would have observed absent treatment. If instead we were pooling the post-treatment periods, we would report the treated group mean of 2/3 alongside the ATT estimate of -6.67 (the average of the period-specific ATTs for periods 1, 2, and 3). For a staggered DID, if the estimator averages over cohorts, one could similarly report the average post-treatment treated-group mean across cohorts.

Of course, one could also report the implied counterfactual mean—i.e., what would have happened in the absence of treatment—along with the ATT. This gives an interpretation closer to what we are used to seeing from randomized experiments. I prefer reporting the observed treated-group mean with the ATT, however, because it presents an actually observed quantity (subject to sampling of course) alongside a model-based estimate. This also has a practical benefit: in robustness checks with different model specifications, the treated-group mean remains fixed while the estimated ATTs vary, providing a stable reference point. The counterfactual mean, by contrast, would vary with each specification.

This is all fairly simple, but I don’t think I have often seen people report the observed post-treatment treated-group mean in DID regression tables. If you know of examples, I’d be happy to see them.

Re-reading “Science, the endless frontier” at a time when our national research institutions are under attack

For those reflecting on the attack on federal support for science it is worth going back to Vannevar Bush’s original arguments from 75 years ago that created this very system of federal supports. You can get the text from NSF website (for now at least…): PDF

Bush reminds readers that the Allies barely won WWII and only did so by contributions of emigre scientists who came up in Europe’s research network.

For any American who questions the value of govt support to research, ask them how the Allies won WWII. Everyone saw Oppenheimer but ask them if they know of, eg, Abraham Wald or Jerzy Neyman. There are many, many more names one could bring up.

The US system was not able, as it was configured as of WWII, to reproduce such talent. His project was to think of how as a society the US could do so as a strategic imperative, and he determined expanded government support was crucial.

Bush continues by discussing wartime advances in addressing infection and disease stimulated by a concerted government investment. He saw in this a potential for govt to both spark rapid innovation and support the steady work to allow such sparks to grow.

He famously distinguished “basic science” as the pursuit of knowledge for knowledge’s sake, which created the raw materials for applied innovation as a byproduct.

Business incentives and the modest scale of private foundations meant basic research was likely underprovided without govt support.

Moreover, talent was being wasted because too many people lacked opportunity or means to pursue their talents and gain advanced training. Govt support could unlock such talent and to the enormous benefit of society.

He proposed a system of support to applied research out of departmental agencies and then a separate National Science Institute (would become Foundation) to support basic research. Universities would be the core partners because their mission prioritized truth-seeking.

The arguments remain convincing to this day. Re-reading Bush’s words does make one long for the sentiment it conveys of broad based cooperation for building a safer and healthier society.

Can we quantify the value of a proposed experiment ex ante?

Suppose we propose an experiment to estimate the effect of a treatment X on an outcome Y where Y is a binary 0,1 outcome. Call that effect b. If the base rate is p then the effect b is bounded to be between -p and 1-p.

Our decision problem is to decide whether treatment should be adopted. If so, the reward is b. If not the reward is 0.

Suppose if we run the experiment, we learn b, in which case we can make the optimal choice of adopting the treatment when b>0. Thus if we learn b we could ensure a payoff of max(0,b).

Suppose we collect priors on the value of b, giving rise to a prior distribution F(b). Does this prior distribution tell us about the value of running the proposed experiment to estimate b?

For example, it could be that F(b) is a spike distribution at value b=0, implying ex ante consensus that X has no effect on Y. Suppose we act on the prior, and do not implement. If the prior is exactly right, we obtain the optimal payoff of 0. If prior was actually overconfident such that b is actually negative, then again we secure the optimal payoff of 0. But if the prior was unduly pessimistic, acting on it could miss out on some gain b>0. In the absence of any other information this could be as large as 1-p. The maximum potential regret for not having run the experiment is 1-p. (Regret in this worst case is optimal 1-p minus realized 0.)

Suppose another treatment that has potential c on the same outcome, for which the base rate is still p. As before, if we ran the experiment for this treatment we could secure the optimal payoff of max(0,c). Suppose that for this second treatment however, the priors we obtain give rise to a distribution G(c) that is uniform over the interval 0 to 1-p. The prior expectation is (1-p)/2 > 0. If we acted in the basis of our prior, we would adopt the treatment., securing payoff c. If c>0, we are better off for having done so. If c<0, we would have made a mistake acting on the prior and could do as badly as -p. The maximum potential regret for not having run the experiment is p. (Regret in this worst case is optimal 0 minus realized -p.) Under all of the assumptions above, including the different priors, the comparative value of experimenting with the first treatment versus the second depends on p.

These thoughts came to mind when listening to presentations by Brian Nosek and by Stefano Dellavigna at the BITSS conference this past week. Nosek presented a pilot project called the Lifecycle Journal, which proposes to use various existing specialized research evaluation services to quality rate facets of a study. You can obtain quality ratings for statistical power, fidelity to a pre analysis plan, and other features. He raised the intriguing possibility of obtaining a quality rating for the experiments that you are proposing to run by eliciting priors about potential effects using the Social Science Prediction Platform (SSPP). The set of quality ratings could be compiled from these separate services and then serve to establish the credibility of a study that sidesteps the need for peer review.

Later at the meeting Dellavigna presented some results from the data that SSPP has collected to date on both priors and experimental outcomes. I asked Dellavigna whether they had considered using the priors to construct ex ante metrics of the value of experiments, and he replied that they had thought about it but they hadn’t yet settled on a decision framework for doing so.

These ideas and the analysis above raise the intriguing possibility of designing a methodology for quantitatively measuring the ex ante (that is, prior to knowing any results) value of a proposed experiment. The inputs could be crowdsourced or expert-determined information about payoffs (the treatment adoption payoffs), base rate information from existing data, and then crowdsourced or expert provided priors about effect sizes (like what SSPP already does). This could put ratings about the quality of experiments on firmer ground, more like ratings about statistical power than the more taste-based judgments that you get from peer reviewers’ subjective assessments.

PS Sandy Gordon and I also presented at BITSS on a new tool called Data-NoMAD that we developed (with Patrick Su) to allow researchers to create a “digital fingerprint” of their data to allow third parties to authenticate that the data have not been manipulated. Working paper here at arXiv.

Effect heterogeneity restrictions in studying mechanisms and in mediation analysis

At the APSA meeting this past week I discussed a very nice working paper by Blackwell, Ma, and Opacic on testing potential causal mechanisms. Their paper enumerates the assumptions needed to justify the “intermediate outcome test”—that is, the common practice of estimating treatment effects on a mediator variable (an intermediate outcome) to test whether a proposed causal mechanism is plausible: [arxiv].

Considering the case of a binary mediator, without an assumption that the treatment effect on the intermediate outcome is monotonic, this test is not necessarily informative. You could estimate a zero average treatment effect on the mediator but the mechanism could still be active. A zero average treatment effect on the mediator could mean either that the treatment does not affect the mediator for anyone, or that the treatment has a positive effect for some and a negative effect for others, and these effects cancel out. The issue is apparent in the following expression from Blackwell et al.’s paper for the average natural indirect effect (ANIE, also known as the average causal mediation effect):

![Example Image](https://cyrussamii.com/wp-content/uploads/2024/09/anie.jpg)

where delta(a) is the ANIE and M(a) is the potential mediator value when treatment A=a, Y(a,s) is the potential outcome when treatment A=a and mediator M=s, and then we have rho_10 and rho_01 as the probabilities that the effect of the treatment (A) on the mediator (M) is positive or negative, respectively. This expression is non parametric and a simple consequence of applying total probability. The average treatment effect on the mediator is equal to rho_10 – rho_01. You could have rho_10 = rho_01 \ne 0, in which case you would have zero average treatment effect on the mediator, but still have a potentially non-zero ANIE.

Now, the intermediate outcome test is motivated by the intuition that the mediation effect can sometimes be written as the *product* of the effect of the treatment on the mediator and the effect of the mediator on the outcome. From the expression above, you can see that the ANIE can be written as such a product when the conditional effects are equal, i.e., E[Y(a,1) – Y(a,0) | M(1) = 1, M(0) = 0] = E[Y(a,1) – Y(a,0) | M(1) = 0, M(0) = 1] = B, in which case the expression reduces to B*(rho_10 – rho_01). This assumption of effect homogeneity (across groups for which the effect on the mediator is positive or negative) seems pretty strong though, right?

Indeed, it is strong, and, at least a conditional-on-covariates version of it is an implication of sequential ignorability as stated in, e.g., Imai et al. (2010):
![Sequential ignorability](https://cyrussamii.com/wp-content/uploads/2024/09/seqig.jpg)
Personally, I had not given much thought to how the second part of the assumption (expression 5), by imposing restrictions across outcome and mediator potential outcomes, implied effect homogeneity across types defined in terms of how the treatment affects mediator values. If we consider the analogy to instrumental variables, this would be like restricting causal effects to be homogenous across compliers and defiers.

Once one appreciates this implication of sequential ignorability, other things can follow, such as using estimates of conditional effects to identify mediation effects, as in this paper by Fu: [arxiv].

I guess the question is whether we are willing to accept such restrictions on effect heterogeneity in the first place, and if not, whether we are willing to accept other restrictions on effects, such as monotonic effects of the treatment on the mediator. The answer depends on the application, but in any case these papers are important for clarifying what kinds of assumptions you need to defend.