The embedded video links to an Edge talk with Sendhil Mullainathan on the implications of big data for social science. His thoughts come out of research he is doing with computer scientist Jon Kleinberg [website] applying methods for big data to questions in behavioral economics.

Mullainathan focuses on how inference is affected when datasets increase widthwise in the number of features measured—that is, increasing “K” (or “P” for you ML types). The length of the dataset (“N”) is, essentially, just a constraint on how effectively we can work with K. From this vantage point, the big data “revolution” is the fact that we can very cheaply construct datasets that are very deep in K. He proposes that with really big K, such that we have data on “everything,” we can switch to more “inductive” forms of hypothesis testing. That is, we can dump all those features into a machine learning algorithm to produce a rich predictive model for the outcome of interest. Then, we can test an hypothesis about the importance of some variable by examining the extent to which the model relies on that variable for generating predictions.

I see three problems with this approach. First, just like traditional null hypothesis testing it is geared toward up or down judgments about “significance” rather than parameter (or “effect size”) estimation. That leaves the inductive approach just as vulnerable to fishing, p-hacking, and related problems that occur with current null hypothesis testing.* It is also greatly limits what we really learn from an analysis (statistical significance is not substantive significance, and so on). Second, scientific testing is typically some form of causal inference, and yet the inductive-predictive approach that Mullainathan described in his talk is oddly blind to questions of causal identification. (To be fair, it is a point that Mullainathan admits in his talk.) The possibilities of post-treatment bias and bias amplification are two reasons that including more features does not always yield better results when doing causal inference (although bias amplification problems would typically diminish as one approaches having data on “everything”). Thus, without careful attention to post-treatment bias for example, the addition of features in an analysis can lead you to conclude mistakenly that a variable of interest has no causal effect when in fact it does. The third reason goes along with a point that Daniel Kahneman makes toward the end of the video: the predictive strength of a variable relative to other variables is not an appropriate criterion for testing an hypothesized cause-effect relationship. But, the inductive approach that Mullainathan describes would be based, essentially, on measuring relative predictive strength.

Nonetheless, the talk is thought provoking and well worth watching. I also found the comments by Nicholas Christakis toward the end of the talk to be very thoughtful.

*Zach raises a good question about this in the comments below. My reply basically agrees with him.

Zach JonesThis was an interesting talk. I definitely agree with your points about the possibilities of bias. I was a bit surprised by the use to which he put ML, which seemed a bit like causal search (which, admittedly, I know little about). I see some applications of ML as a useful aid to theory building (essentially EDA, I’ve written a paper on this). Clearly it isn’t a substitute for causal inference, which, to me at least, is the at least implicit goal of most social science.

I don’t understand the connection with NHST, p-hacking, etc. that you mention. Is the issue that the feature selection algorithm could have multiple-comparisons problems? Or is this related to how a variable is judged to be “important” as he talks about it? I mentioned the unbiased variable selection algorithms on twitter (which I am sure you are aware of). There are a variety of ways to calculate effect sizes using various sorts of ML too though, such as permutation importance with random forests.

If you weren’t referring to either of those issues (and I am just being obtuse) then I am just confused.

CyrusPost authorZach, I think I get what you are saying. Yes, we can use machines to compute effect sizes in the big K setting. (In fact, I do so — forthcoming here [link].) I was commenting on the hypothesis testing procedure. But then again, I think your question is still relevant. Let’s think of this in general terms. (As a technical aside, I agree with you regarding feature selection algorithms, especially those with good convergence properties like adaptive lasso.) If we use the machine learning tools honestly, feeding in all our features (“everything”) and then letting the machine work its magic, then this removes a lot of the discretion that is the source of p-hacking or fishing. In this way a machine learning approach makes us better scientists by removing some of our discretion. If you watched the video you might find that this perspective is at odds with the comments by Pizarro and Dennett. But I am speaking from a “second best world” perspective (with respect to integrity of scientists), while they are thinking from a first best world. At a meta level, scientists could still hack the whole operation to generate pleasing p-values, but they could also do that to compute pleasing effect sizes.

Johan UganderThanks Cyrus for hosting a forum for discussion around this talk. I found it to be a very interesting talk as well, I have been thinking a lot about it since I watched it a few weeks ago. Recently I had a thought that I wanted to discuss, an issue I had with the inductive/ML approach that Sendhil outlines. In short, I’m not sure his experience studying the disposition effect deals with some of the deeper challenges of truly “big” inductive work.

He points out that when building an ML model to predict profits, based on “everything”, when he hides the disposition effect from the model, the predictive accuracy doesn’t get worse. So he concludes that the disposition effect appears to be a proxy for something else. He also says: “I even hide from it anything to do with the purchase price, so it can’t possibly know anything about the disposition effect”, and it still does no worse. From this he makes claims about the disposition effect.

The big challenge that I think is being swept under the rug here is that, when building models based on “everything”, and seeking to perform a “leave X out” comparison of predictive accuracy, it’s rarely clear how to isolate the family of variables that “know anything” about X. In massive-scale kitchen-sink big data analyses, the full scope of “everything” is enormous. Lots of great machine learning research has shown ways to build models that can appropriately handle “everything” for the purposes of building a maximally predictive model. But with “everything” in play, you typically end up having lots of co-linear variables all over the place: sometimes very strong correlates, or one variables that is really just a slightly different coding of another. Or a suite of variables naturally constrained to sum to some constant. Even in my young career, I’ve reviewed too many papers that study co-linear or functionally coupled variables and read the tea-leaves of relative predictive model accuracy as being interpretable as an important relationship.

Sometimes, as in Sendhil’s example, the dependence can perhaps be scoped (“anything to do with purchase price”). But with “everything” data, identifying this scope can be really challenging, if seeking an interpretable understanding of some variable. Most ML approaches don’t really concern themselves with interpretability (or when they do, there is still much work needed), and I think understanding interpretability is a really important step for adapting ML for “big” inductive science. I’d be very curious to hear your thoughts (Cyrus) or the thoughts of any other readers about this issue or known work on it.

With regards to the bigger picture, I think that the way forward for science here is, as it has always been, a tandem approach between inductive and deductive work. Sendhil’s talk does a really great job of mapping out what the future of that tandem approach may look like, what is changing as data gets “big”.