Big Data's "theory-free" analysis is a statistical malpractice


One of the premises of Big Data is that it can be "theory free": rather than starting with a hypothesis ("men at buffets eat more when women are present," "more people will click this button if I move it here," etc) and then gathering data to validate your guess, you just gather a ton of data and look for patterns in it.


The thing is, patterns emerge in every large dataset, without necessarily being representative of a wider statistical truth. Think of the celebrated rise and fall of Google Flu: researchers examined the 45 search terms that were most prevalent where the flu had spread and concluded that these were predictors of flu, but the predictive power turned out to be an illusion. Every place has 45 top search terms, all the time, and some of them will coincide with flu outbreaks, but without a causal theory that you can test, all you know for sure is that you've found an incident of correlation, and no way to know whether the correlation is coincidence or a newly discovered iron law.


Writing in Wired, Pomona College economist Gary Smith -- author of books on statistical malpractice like Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics and The AI Delusion -- runs down several examples of how theory-free data-mining got its practitioners in to trouble (including a celebrated Cornell professor who was forced to resign after telling his grad student to "Work hard, squeeze some blood out of this rock" by looking for patterns in a data-set about buffet eaters.


Big Data is still a useful statistician's tool, and can be examined to gain intuition that leads to new hypotheses -- but those hypotheses then need to be investigated with statistical rigor.



Good research begins with a clear idea of what one is looking for and expects to find. Data mining just looks for patterns and inevitably finds some.


The problem has become endemic nowadays because powerful computers are so good at plundering Big Data. Data miners have found correlations between Twitter words or Google search queries and criminal activity, heart attacks, stock prices, election outcomes, Bitcoin prices, and soccer matches. You might think I am making these examples up. I am not.




The Exaggerated Promise of So-Called Unbiased Data Mining [Gary Smith/Wired]



(Image: Big Data: water wordscape, Marius B, CC-BY)