The embedded video links to an Edge talk with Sendhil Mullainathan on the implications of big data for social science. His thoughts come out of research he is doing with computer scientist Jon Kleinberg [website] applying methods for big data to questions in behavioral economics.

Mullainathan focuses on how inference is affected when datasets increase widthwise in the number of features measured—that is, increasing “K” (or “P” for you ML types). The length of the dataset (“N”) is, essentially, just a constraint on how effectively we can work with K. From this vantage point, the big data “revolution” is the fact that we can very cheaply construct datasets that are very deep in K. He proposes that with really big K, such that we have data on “everything,” we can switch to more “inductive” forms of hypothesis testing. That is, we can dump all those features into a machine learning algorithm to produce a rich predictive model for the outcome of interest. Then, we can test an hypothesis about the importance of some variable by examining the extent to which the model relies on that variable for generating predictions.

I see three problems with this approach. First, just like traditional null hypothesis testing it is geared toward up or down judgments about “significance” rather than parameter (or “effect size”) estimation. That leaves the inductive approach just as vulnerable to fishing, p-hacking, and related problems that occur with current null hypothesis testing.* It is also greatly limits what we really learn from an analysis (statistical significance is not substantive significance, and so on). Second, scientific testing is typically some form of causal inference, and yet the inductive-predictive approach that Mullainathan described in his talk is oddly blind to questions of causal identification. (To be fair, it is a point that Mullainathan admits in his talk.) The possibilities of post-treatment bias and bias amplification are two reasons that including more features does not always yield better results when doing causal inference (although bias amplification problems would typically diminish as one approaches having data on “everything”). Thus, without careful attention to post-treatment bias for example, the addition of features in an analysis can lead you to conclude mistakenly that a variable of interest has no causal effect when in fact it does. The third reason goes along with a point that Daniel Kahneman makes toward the end of the video: the predictive strength of a variable relative to other variables is not an appropriate criterion for testing an hypothesized cause-effect relationship. But, the inductive approach that Mullainathan describes would be based, essentially, on measuring relative predictive strength.

Nonetheless, the talk is thought provoking and well worth watching. I also found the comments by Nicholas Christakis toward the end of the talk to be very thoughtful.

*Zach raises a good question about this in the comments below. My reply basically agrees with him.