This article will make you feel better. And you do need to feel better, if you are one of the many of us who practice analytics—or who must consume and rely on analytics—and find ourselves carrying tension in our shoulders or sometimes losing sleep.
The fear stems from a well-known warning of tragic mishap: “If you torture the data long enough, it will confess,” as stated by University of Chicago economics professor Ronald Coase. There is a general sense that math could be wrong and that analytics is an art.
As John Elder of Elder Research put it, “It’s always possible to get lucky (or unlucky). When you mine data and find something, is it real, or chance?” How can we confidently trust what a computer claims to have learned?
The in-depth technical paper, “Are Orange Cars Really not Lemons?” by Ben Bullard and John Elder, of Elder Research, Inc., shares this case:
A recent article in The Seattle Times, reported that “an orange used car is least likely to be a lemon.” This discovery surfaced in a competition hosted by Kaggle to predict bad buys among used cars using a labeled dataset. Of the 72,983 used cars, 8,976 were bad buys (12.3%). Yet, of the 415 orange cars in the dataset, only 34 were bad (8.2%)…
[But] what we see is that statistical hypothesis tests only work when the hypothesis comes first, and the analysis second. One cannot use the data to inform the hypothesis and then test that hypothesis on the same data. That leads to overfit and over-confidence in your results, which leads to the model underperforming (or failing entirely) on new data, where it is most needed.
How do we avert the dire declension, “Lies, damned lies, and statistics”? There is a simple, elegant solution from Elder—but first, let me further magnify your fear: Even the very simplest predictive model risks utter failure. Mistaken, misleading conclusions are in fact terribly easy to come by.
A conclusion drawn about one single variable—even without the use of a common multivariate model (such as log-linear regression)—can go awry. In fact, one of the more famous such analytical insights, “an orange used car is least likely to be a lemon,” has recently been debunked by Elder and his colleague Ben Bullard at Elder Research, Inc.
Big data, with all its pomp and circumstance, can actually mean big risk. More data can present more opportunities to inadvertently discover untrue patterns that appear misleadingly strong within your dataset—but, in fact, do not hold true in general. To be more specific, “bigger” data could mean longer data (a longer list of examples, which generally helps avert spurious conclusions), but also could mean wider data (more columns—more variables/factors per example). So, even if you are only considering one variable at a time, such as the color of each car, you are more likely to come across one that just happens to look predictive in your data by sheer chance alone. This peril that arises when searching across many variables has been dubbed by John Elder vast search.
Dr. Elder puts it this way: “Modern predictive analytic algorithms are hypothesis-generating machines, capable of testing millions of ‘ideas.’ The best result stumbled upon in its vast search has a much greater chance of being spurious… The problem is so widespread that it is the chief reason for a crisis in experimental science, where most journal results have been discovered to resist replication; that is, to be wrong!”
A few years ago, Berkeley Professor David Leinweber made waves with his discovery that the annual closing price of the S&P 500 stock market index could have been predicted from 1983 to 1993 by the rate of butter production in Bangladesh. Bangladesh’s butter production mathematically explains 75 percent of the index’s variation over that time. Urgent calls were placed to the Credibility Police, since it certainly cannot be believed that Bangladesh’s butter is closely tied to the U.S. stock market. If its butter production boomed or went bust in any given year, how could it be reasonable to assume that U.S. stocks would follow suit? This stirred up the greatest fears of PA skeptics, and vindicated nonbelievers. Eyebrows were raised so vigorously, they catapulted Professor Leinweber onto national television.
Crackpot or legitimate educator? It turns out Leinweber had contrived this analysis as a playful publicity stunt, within a chapter entitled “Stupid Data Miner Tricks” in his book Nerds on Wall Street. His analysis was designed to highlight a common misstep by exaggerating it. It’s dangerously easy to find ridiculous correlations, especially when you’re “predicting” only 11 data points (annual index closings for 1983 to 1993). By searching through a large number of financial indicators across many countries, something or other will show similar trends, just by chance. It will eventually unearth cockamamie relationships. For example, shiver me timbers, a related study showed buried treasure discoveries in England and Wales predicted the Dow Jones Industrial Average a full year ahead from 1992 to 2002.
Leinweber attracted the attention he sought, but his lesson didn’t seem to sink in. “I got calls for years asking me what the current butter business in Bangladesh was looking like and I kept saying, ‘Ya know, it was a joke, it was a joke!’ It’s scary how few people actually get that.” As Black Swan author Nassim Taleb put it in his suitably titled book, Fooled by Randomness, “Nowhere is the problem of induction more relevant than in the world of trading—and nowhere has it been as ignored!” Thus the occasional overzealous yet earnest public claim of economic prediction based on factors like women’s hemlines, men’s necktie width, Super Bowl results, and Christmas day snowfall in Boston.
The culprit that kills machine learning is overlearning (aka overfitting). Overlearning is the pitfall of mistaking noise for information, assuming too much about what has been shown within data. You’ve overlearned if you’ve read too much into the numbers, led astray from discovering the underlying truth.
While many analytics practitioners consider overlearning a risk with predictive models that combine multiple variables, the truth is even well-publicized single-variable results are at risk. A dire need for a new paradigm has emerged.
But is it really that hard? Why would analysts now assert that standard tests of statistical significance break down when vast search is in play?
And what can be done to validate (i.e., test for significance) even after vast search has claimed to have made a discovery?
Dr. Elder will be covering the topic with his presentation, “The Peril of Vast Search (and How Target Shuffling Can Save Science)” at Predictive Analytics World events this year in San Francisco, Toronto, Chicago, Washington DC, Boston and London.
Portions adapted with permission of the publisher, Wiley, from Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die (February 2013) by Eric Siegel, PhD, the founder of Predictive Analytics World.