Does Correlation Trump Causation?


A new book called Big Data: A Revolution That Will Transform How We Live, Work and Think, written by Viktor Mayer-Schonberger of Oxford and Kenneth Cukier of The Economist, raises some intriguing and provocative issues for data publishers.  Among  them is this one:

 “…society will need to shed some of its obsession for causality in exchange for simple correlation: not knowing why but only what.”

The underlying thinking as I understand it is that Big Data, because it can analyze and yield insight from millions or even billions of data points, is both incredibly powerful and uncannily accurate, in large part because of the massive sample sizes involved.

But are all Big Data insights created equal?

Without a doubt, some insights from Big Data analytics yields useful and low-risk results. If Big Data, for example, were to determine that from a price perspective, the best time to purchase an airline ticket is 11 days prior to departure, I have both useful information and not a care in the world about causation. Ironically, in this example, Big Data would be used to outsmart airline Big Data analytics that are trying to optimize revenues through variable pricing.

But riding solely on correlation often creates situations where heavy-handed or even ridiculous steps would be necessary to act on Big Data insights. Consider a vexing issue such as alcoholism. What if we learned through Big Data analytics that left-handed males who played tennis and drove red cars had an unusually high propensity to become alcoholics? Correlation identifies the problem, but it doesn’t provide much of a solution. Do we ban alcohol for this entire group? Do we tell left-handed males that they can either play tennis or drive a red car, but not both? Does breaking the correlative pattern actually work to prevent the correlated result? Things can get strange and confusing very quickly when you rely entirely on correlation.

Am I calling into question the value of Big Data analytics? Not at all. The ability to powerfully analyze massive data sets will be beneficial to all of us, in many different ways. But to suggest that Big Data correlations can largely supplant causation research plays into the Big Data hype by suggesting it is a pat, “plug and play” solution to all problems. Big Data can very usefully shape and define causal research, but there are numerous situations where it can’t simply replace it.

The lesson here is that while you should embrace Big Data and its big potential, remain objective and ask tough questions to separate Big Data from Big Hype because lately, the two have been tightly correlated.

Comment