“Big Data” is a fashionable buzzword these days. It refers to the practice at many companies (especially Internet companies) to collect insanely massive data sets by permanently storing pretty much everything. Google, for example, stores nearly everything anyone ever does on any Google website and any site whch uses Google advertising or analytics. That’s a lot of data.
Companies do this not to be creepy (though it certainly is that), but because they believe they can use this massive data set to tease out patterns of user behavior. More data equals more insights, right?
Nassim Taleb published an editorial in Wired a few days ago called “Beware the Big Errors of Big Data.” There are a few problems with the “let’s throw more data at it” approach to analysis:
- First, no data set is perfect. Even Google’s online panopticon is rife with missing data and errors, because it can’t perfectly connect the actions of a person to the individual. A recent study showed that the great-granddaddies of Big Data, credit bureaus, have significant mistakes (i.e. bad enough to change someone’s credit score) on 20% of records. Any large statistical analysis is going to have to be wary that the insights reflect real patterns of human behavior, and not patterns of systematic errors in the underlying data. This can be subtle and difficult to detect.
- Then there’s the data mining problem. The beauty of statistical analysis of very large data sets is it lets us test vast quantities of hypotheses to see whether there’s a relationship. The problem is that the more relationships you test, the more false positives you get because of statistical flukes.
That’s not to say that Big Data isn’t useful, just that it has its limits. By themselves, large data sets only let us establish patterns of correlation between things: “If A happens, B is also likely to happen.”
Correlation is the weakest possible relationship between things. It doesn’t tell us whether A causes B, whether B causes A, whether A and B are both caused by some other underlying factor C, or whether it’s just a coincidence. Establishing that A causes B requires a different kind of data and not just more of the same data: perhaps a randomized trial, or (better yet) a randomized trial with a theory for the underlying mechanism.
So while Big Data is good, it can only go so far. Be aware of its limits.