Nailing 5 Theses to the Door of the Big Data Church


Share on LinkedIn

Last Friday I took part in the DAA’s latest version of Ask Anything where DAA Members can send in questions and the designated responder (me for the day) does his or her best to say something sensible in return. At the end of the day, Jim Sterne sent in a final question on the role of the analytics warehouse and machine learning that I simply couldn’t resist expanding into full essay form. If you’re confused about the role of machine learning in the warehouse or suspicious of the claims of technology vendors touting the benefits of massive correlation to discover the keys to your business, read on!

Since technically I’m off-duty for Ask Anything (Friday is so yesterday. I’m parked at a Starbucks early on Saturday morning while my youngest daughter takes some bizarre High-School entrance exam called the SSAT – just guessing, but does the first S stand for Scam?) I thought I’d take a little more time and expand the answer to this last question into a full-on blog and post outside the forum as well. I hope that doesn’t violate any DAA contracts or anything – I’d hate to have some data assassin tracking me by my digital exhaust and switching all my 1’s to 0’s…

For those not consuming the entire DAA Ask Anything thread, here’s Jim’s question:

Have we reached a point / have you seen anybody be successful / have you seen anybody really try to build a data lake of customer information and use machine learning to derive correlations from it? All I’m looking for is a tool and a data set that will say:

It turns out people buy more stuff online when it snows. Is this correlation

[X]  Worthless? [  ]  Curious? [  ]  Interesting? [  ]  Fascinating? [  ]  Actionable ?

The data suggests that sending email #256b after customers of type 83h have viewed product 87254 results in a 298.4% increase in sales. Is this correlation

[  ]  Worthless? [  ]  Curious? [  ]  Interesting? [  ]  Fascinating? [X]  Actionable ?

Are we getting any closer to turning that corner??

It’s a great ask and gets to what is perhaps the central issue around analytics and analytics technology in the enterprise; namely, what’s the role of the analytics warehouse and how seriously should we take the claims of big data machine learning advocates?

I’ll start with a short answer to Jim’s first and most direct question and then I’m going to expand on his two sub-questions to cast some light on the nature of that answer.

Have I seen clients build a data lake and get value out of it? Unequivocally yes. Our best clients – the one’s really getting value out of digital analytics are nearly all using some form of advanced analytics warehouse / data lake and doing so very successfully.

Does that analytic value come from machine learning? Rarely. The vast majority of analytic value has come from traditional statistical techniques or from straight algorithmic selections (programmatic or SQL access to the detail data). I actually believe there is much value to be had from non-traditional techniques, but that’s more my theory than proven field fact and I absolutely am not a big fan of the massive correlation approach that I see most commonly advocated by machine learning folks (though I don’t want to cast a broad-brush here – there’s lots of different flavors of machine learning and the term itself is a bit ambiguous).

So let’s tackle your examples in more detail:

It turns out people buy more stuff online when it snows. Is this correlation

[X]  Worthless? [  ]  Curious? [  ]  Interesting? [  ]  Fascinating? [  ]  Actionable ?

I’m going to disagree with Jim’s selection here (at least partly – though later I’m going to agree with what I take to be his broader point).

Weather, as it happens, is often a highly predictive and essential variable when it comes to retail models. In Mrs. Fields’s store baking models for example, weather was the single biggest variable factor. It has a huge impact on whether people will buy a warm cookie or not. It also has a big impact of whether people will shop in store and, as it happens, online.

And weather impacts aren’t limited to retail where people buy online because they can’t get to the store and are going stir-crazy.

I remember from personal experience an interesting case where we were analyzing the PPC campaigns for an Internet site focused on real-time traffic. When we did the analysis, we found (big surprise), that giant storms in the Northeast drove massive increases in site traffic. That may seem obvious. No, that is obvious. But here’s the thing, they weren’t regulating their PPC buys that way. They had a simple fixed daily budget. That meant that on beautiful summer days they were spending the same amount as on blizzard days. Their budgets for PPC and their daily caps were keeping them from expanding their buys in December, so they were simply losing out on the opportunity to capture more (and, by our measurement, more engaged and valuable) customers. We found other important, local effects (like closures) and by shifting their buying model to something more local were able to dramatically improve their overall performance in PPC. Actionable.

We often find retailer’s PPC vendors ignoring weather – and that’s almost always a BAD idea.

Weather matters in all sorts of places and around all sorts of use cases. I’d make a small wager that folks are less likely to shop for life insurance or 529’s on beautiful Saturday afternoons than rainy dreary ones – and that may matter when it comes to thinking about when I drop (and pay for) a display ad.

Understanding when people do something has real meaning in digital (and, even more, outside digital), and weather is an important part of when they do something.

So I’m going to disagree with Jim’s immediate answer (Worthless). Finding out that people buy more stuff online when it snows (they do!) is fascinating and actionable – not least in allowing you to amp up your PPC buys to capture mostly offline customers (driven to online by being snow-bound and open to capture by new brands) of your competitors who aren’t thinking about weather.

On the other hand, I’m on board with what I take Jim’s deeper point to be (and sorry for hijacking the point into a bunch of “weather” matters examples). When Ms. Fields built their model, when we did our PPC analysis or built our utilities model, we didn’t use machine learning techniques (narrowly defined as massive, undirected analysis of variables to discover important relationships) to happen on weather as an important variable. We knew it was at least likely to be significant and we modeled it the old-fashioned way.

The real question is how many important variables are there that analysts don’t know about and is it worth randomly assembling data to find them? I’m very, very, very skeptical about this. It’s true that analysts don’t always understand the business their modeling very well. To solve this problem, you can:

  1. Hire analysts who understand your business
  2. Train your analysts in the business via temporary immersion
  3. Track down every conceivable exogenous data source and use machine learning

If you picked C, you’re probably a salesperson for a technology vendor or a data science consultancy. Can unexpected correlations and important variables sometimes be discovered? Of course. But most businesses actually have a pretty decent understanding of the key factors driving performance even if they can’t describe exactly how those key factors relate or interact. When that’s the case, massive correlation is just a big, truly massive, very impressive waste of time.

Here’s some thoughts that seem to me so basic and obvious that I’m almost embarrassed to write them down except that I often meet people who don’t seem to grasp them:

  1. There are an infinite (truly) set of possible external data points – so machine learning is always guided – it’s just slightly less guided than when we were limited to 50 variables instead of 5000.
  2. Most of the work in analysis isn’t in doing correlations. It’s in assembling, cleaning and lining up the data and then in making sense of what possible correlations mean. Discovering correlations in the data via massive analysis of variables doesn’t really save that much time.
  3. Data points often don’t line up in ways that make it easy or even practical to auto-correlate them. It can take a truly massive amount of work to align certain kinds of operational data with marketing data in an intelligible and potentially interesting manner. Machine learning does NOT help with this and is dependent on this exercise to be successful with these data types.
  4. In the vast majority of cases, there just aren’t hidden variables that no one ever thought of that are somehow the secret key to your business. This kind of “grail” thinking is pervasive in all walks of life and it’s amusing but disheartening to see that analysts are as susceptible as everyone else to this type of delusion. There are plenty of consultancies who will gladly waste your money chasing that dream, but you would be much better off inventing in lottery tickets or hiring an astute court astrologer.
  5. It’s much more likely that there ARE important variables that everybody knows exist but nobody has access to or can measure. Figuring these out is far, far more important than hunting for unlikely correlations in the data.

I feel a bit like Martin Luther nailing these five (90 short of Martin) theses to the wall of the big data church. To me they seem so obvious that’s hard to understand how they could possibly be controversial.

Which brings me to Jim’s second sub-question and one that I think we can now handle quite quickly because we are in complete agreement:

The data suggests that sending email #256b after customers of type 83h have viewed product 87254 results in a 298.4% increase in sales. Is this correlation

[  ]  Worthless? [  ]  Curious? [  ]  Interesting? [  ]  Fascinating? [X]  Actionable ?

Yes, clearly right. And by putting these two examples forward, I assume Jim means to sneakily suggest that most of the value in analytics comes from very unsurprising places. We are always charmed to hear stories of sudden analytic insight, swift brilliance and amusing and unexpected correlations. But real business analytics adds value mostly by delving in patient and disciplined detail into what we think is probably true.

It’s the difference between the real, day-to-day practice of science and the “genius” models that dominate public imagination. I’m not a big believer in the genius models, even for true genius. Mostly, I suspect it’s a lot more work than people like to think.

But if I’m not confident how genius works, I am sure that genius is not a strategy.

If you want to build an effective analytics team, the right strategy is to focus your attention on the problems you know matter and the data you think is probably important. Know your business? Absolutely always. Get the data you think you need? Definitely. Massive correlations of stuff that you doubt makes a difference? Occasionally…maybe.


Here’s the link (you must – and should – be a member) to the DAA thread

Republished with author's permission from original post.

Gary Angel
Gary is the CEO of Digital Mortar. DM is the leading platform for in-store customer journey analytics. It provides near real-time reporting and analysis of how stores performed including full in-store funnel analysis, segmented customer journey analysis, staff evaluation and optimization, and compliance reporting. Prior to founding Digital Mortar, Gary led Ernst & Young's Digital Analytics practice. His previous company, Semphonic, was acquired by EY in 2013.


Please use comments to add value to the discussion. Maximum one link to an educational blog post or article. We will NOT PUBLISH brief comments like "good post," comments that mainly promote links, or comments with links to companies, products, or services.

Please enter your comment!
Please enter your name here