Caution: Big Data Ahead

February 14, 2013

“Big Data” is a fashionable buzzword these days. It refers to the practice at many companies (especially Internet companies) to collect insanely massive data sets by permanently storing pretty much everything. Google, for example, stores nearly everything anyone ever does on any Google website and any site whch uses Google advertising or analytics. That’s a lot of data.

Companies do this not to be creepy (though it certainly is that), but because they believe they can use this massive data set to tease out patterns of user behavior. More data equals more insights, right?

Nassim Taleb published an editorial in Wired a few days ago called “Beware the Big Errors of Big Data.” There are a few problems with the “let’s throw more data at it” approach to analysis:

First, no data set is perfect. Even Google’s online panopticon is rife with missing data and errors, because it can’t perfectly connect the actions of a person to the individual. A recent study showed that the great-granddaddies of Big Data, credit bureaus, have significant mistakes (i.e. bad enough to change someone’s credit score) on 20% of records. Any large statistical analysis is going to have to be wary that the insights reflect real patterns of human behavior, and not patterns of systematic errors in the underlying data. This can be subtle and difficult to detect.
Then there’s the data mining problem. The beauty of statistical analysis of very large data sets is it lets us test vast quantities of hypotheses to see whether there’s a relationship. The problem is that the more relationships you test, the more false positives you get because of statistical flukes.

That’s not to say that Big Data isn’t useful, just that it has its limits. By themselves, large data sets only let us establish patterns of correlation between things: “If A happens, B is also likely to happen.”

Correlation is the weakest possible relationship between things. It doesn’t tell us whether A causes B, whether B causes A, whether A and B are both caused by some other underlying factor C, or whether it’s just a coincidence. Establishing that A causes B requires a different kind of data and not just more of the same data: perhaps a randomized trial, or (better yet) a randomized trial with a theory for the underlying mechanism.

So while Big Data is good, it can only go so far. Be aware of its limits.

Republished with author's permission from original post.

Caution: Big Data Ahead

ADD YOUR COMMENT
Please use comments to add value to the discussion. Maximum one link to an educational blog post or article. We will NOT PUBLISH brief comments like "good post," comments that mainly promote links, or comments with links to companies, products, or services.

New Posts

Four Strategies to Revolutionize B2B Revenue

CX Design Part I: How Leading Enterprise Tech Firms Overcome Complex Challenges

The CMOs 5-Step Guide to Boosting High-Quality Traffic and Customer Engagement

Mastering the Digital Landscape: Crafting a Winning Social Media Marketing Strategy in 2024

7 Amazing Ways Costco Boosts The Consumer Economy

ADD YOUR COMMENT Please use comments to add value to the discussion. Maximum one link to an educational blog post or article. We will NOT PUBLISH brief comments like "good post," comments that mainly promote links, or comments with links to companies, products, or services.

New Posts

ADD YOUR COMMENT
Please use comments to add value to the discussion. Maximum one link to an educational blog post or article. We will NOT PUBLISH brief comments like "good post," comments that mainly promote links, or comments with links to companies, products, or services.