Caution: Big Data Ahead


Share on LinkedIn

“Big Data” is a fashionable buzzword these days. It refers to the practice at many companies (especially Internet companies) to collect insanely massive data sets by permanently storing pretty much everything. Google, for example, stores nearly everything anyone ever does on any Google website and any site whch uses Google advertising or analytics. That’s a lot of data.

Companies do this not to be creepy (though it certainly is that), but because they believe they can use this massive data set to tease out patterns of user behavior. More data equals more insights, right?

Nassim Taleb published an editorial in Wired a few days ago called “Beware the Big Errors of Big Data.” There are a few problems with the “let’s throw more data at it” approach to analysis:

  • First, no data set is perfect. Even Google’s online panopticon is rife with missing data and errors, because it can’t perfectly connect the actions of a person to the individual. A recent study showed that the great-granddaddies of Big Data, credit bureaus, have significant mistakes (i.e. bad enough to change someone’s credit score) on 20% of records. Any large statistical analysis is going to have to be wary that the insights reflect real patterns of human behavior, and not patterns of systematic errors in the underlying data. This can be subtle and difficult to detect.
  • Then there’s the data mining problem. The beauty of statistical analysis of very large data sets is it lets us test vast quantities of hypotheses to see whether there’s a relationship. The problem is that the more relationships you test, the more false positives you get because of statistical flukes.

That’s not to say that Big Data isn’t useful, just that it has its limits. By themselves, large data sets only let us establish patterns of correlation between things: “If A happens, B is also likely to happen.”

Correlation is the weakest possible relationship between things. It doesn’t tell us whether A causes B, whether B causes A, whether A and B are both caused by some other underlying factor C, or whether it’s just a coincidence. Establishing that A causes B requires a different kind of data and not just more of the same data: perhaps a randomized trial, or (better yet) a randomized trial with a theory for the underlying mechanism.

So while Big Data is good, it can only go so far. Be aware of its limits.

Republished with author's permission from original post.

Peter Leppik
Peter U. Leppik is president and CEO of Vocalabs. He founded Vocal Laboratories Inc. in 2001 to apply scientific principles of data collection and analysis to the problem of improving customer service. Leppik has led efforts to measure, compare and publish customer service quality through third party, independent research. At Vocalabs, Leppik has assembled a team of professionals with deep expertise in survey methodology, data communications and data visualization to provide clients with best-in-class tools for improving customer service through real-time customer feedback.


Please use comments to add value to the discussion. Maximum one link to an educational blog post or article. We will NOT PUBLISH brief comments like "good post," comments that mainly promote links, or comments with links to companies, products, or services.

Please enter your comment!
Please enter your name here