The Limits of Machine Analysis


Share on LinkedIn

Since I’m about to pick fault with one small piece of it, I want to emphasize how much I enjoyed and how thought-provoking Ryan Caplan, President of ColdLIght Solution’s, presentation, really was. Not only did I appreciate the relaxed style, but I think the content was unusually thought-provoking. What I at first thought might be another reprise on “what makes a good analyst” turned out to be a subtle approach to the question of the role of the analyst in the world of big-data, machine-learning and massive integration.

There are many different paths to analysis, but probably the classic path is to begin with a question, pick a method of analysis, and then determine which variables to use/assess within that method. It’s a perfectly good procedure and it works just fine. As Ryan pointed out, however, it has problems of scale. It turns out that our big-data isn’t just long, it’s wide too. As we integrate more data points and more sources, the number of variables available for analysis grows and can quickly exceed the ability of any analyst or team to use in this traditional iterative fashion.

Fortunately, high-performance systems serve double duty for us here as well. They can work on lots of data (length) and they can process lots of columns (breadth) and figure out which ones matter. There are, in fact, a set of data analysis techniques specifically designed to look at the complex inter-relationships of many variables in an efficient manner. With these techniques, you can analyze hundreds or even thousands of independent variables to develop a model. An analyst can’t to do this iteratively.

So the advent of high-performance analysis systems offers a significant variation on the traditional role of analyst. With standardized analysis methods and automated model-building across very large numbers of variables, the analyst is left with only “question generation” as a key task. Even that function may be partially subsumed in an automated, data-driven analysis. By finding unexpected relationships and influence patterns between variables, this type of analysis can actually generate questions (why does variable X correlate to variable Y?) – changing the analyst’s role from someone who starts with a question and works toward an answer to someone who starts with an answer and works toward an explanation.

I’m a believer in the technologies and methods involved here. We’ve been doing neural network analysis and segmentation at Semphonic since our inception. There are, however, some important limitations on machine-learning and automated analysis that I think can be missed or poorly understood and I wanted to talk about that in light of one of the examples that Ryan gave.

I wasn’t taking notes during the presentations, so I’m going to do my best to reconstruct this from memory. If I get it wrong, I’m thinking Ryan will probably correct me!

Comcast was looking to understand the performance of movies in Video on Demand (VOD). In particular, they were testing a hypothesis that box office performance would be the key correlate to VOD performance. Traditionally, an analyst would have tested this hypothesis directly. However, by running the analysis using these techniques for handling large numbers of variables, it was discovered that the first letter of the Movie’s name was quite significant – and that starting with “A” was fairly predictive of VOD success.

It should take this audience only a moment to realize why that relationship exists: the VOD is a guided system organized alphabetically. In effect, you get a significant nudge since “A” is the first category of movies and the first set of movies you’ll see. It’s a common and well-understood UI phenomenon of the sort I described in this post on the intellectual foundation of Web analytics.

On the other hand, it’s the type of relationship that an analyst might easily miss and an automated data analysis system manage to call out.

Or is it? What bothered me about the example is the somewhat implicit representation that these types of analysis systems will identify ANY pattern in the data – even the first letter of a movie name. That isn’t really the case and it’s important to understand why it isn’t the case if you’re really going to think about the role of the analyst when working with this type of technology.

Because here’s the thing: it’s impossible for any system to generate and check every possible pattern in the data.

Don’t believe me? Then let’s consider just the variable “Movie Name”. Let’s assume it’s a sixty character string data field. Now consider that a pattern inside that field might involve the first character, but it might also involve the first two characters. Perhaps movies that begin with AZ perform better than movies that begin with AB. Or perhaps the pattern involves the first three characters. Or the first and third character. And so on. It should quickly be apparent that finding all the possible relationships for a single field Movie Name would involve something like 6 vigintillion (look it up) different combinations. Even God does not have time for this.

And, of course, this represents only a small fraction (!) of the possible patterns within a single field. It might be that vowels do better than consonants and the movies with many vowels in their title do better than otherwise expected. Or perhaps personal pronouns or soft phonemes or movies with colors in their title do better. The meta-data possibilities are, quite literally, infinite. Exhaustive examination of the potential patterns is impossible. Not difficult. Impossible.

So here’s the thing, somebody created something that created a variable (perhaps an order of presentation or just a row order that happened to be in the data) that happened to call out the significance of the first character of the Movie Name. Because no analytics system that has ever existed or will ever exist discovered it by brute force.

Nor is this a trivial example or a calling out of arcana. While system’s capable of massive variable analysis change the role of the analyst when it comes to identifying variables for inclusion, they don’t really remove that function and, in some respects, they make it harder. In the old days, we would have just tested the obvious variable – Box Office Performance and been done. Now, we have to comb through our vast reams of data to figure out what should be included. As my example above shows, it’s impossible to pick everything. Sure, you can pick every single field you have available to you, but that’s nothing like EVERYTHING you have available to you.

Think I’m stretching the point?

Here’s a real world example of my own.

A few years back we built a customer segmentation for an online travel aggregator.

The data points we got from them were, by customer, search and transaction detail data. We had the date and time of every search and transaction, the type of search and transaction, the DMA location of the searcher and the search itself, the dollar amounts involved, and the dates of any stay or trip leg.

As we built the segmentation, however, we found that nearly every variable that was interesting and predictive turned out to be a form of meta-data or transformation on the existing fields. The date and time of a search was largely meaningless until it was subtracted from the date and time of the first trip leg to yield a variable called “Days till Search.” That variable didn’t exist in the initial set. The location of the searcher wasn’t too interesting until we paired it with the location of the search, geo-located both, and derived a distance between the two for a variable called “Distance of Trip.” The Date and Time of each Trip leg weren’t very interesting until, you guessed it, we subtracted the first from the last to get a “Trip Duration.” The location of the search itself wasn’t too interesting till we categorized destinations by “Business” or “Leisure”. The dates of the trip selected weren’t too interesting till we grouped searches by Visit and categorized searchers that had “Multiple Days” for a single Destination and “Multiple destinations” for a single day. That last set of categorizations turned out to be the single most interesting variable in the whole analysis.

Almost every interesting variable turned out to require this type of analyst identification. They didn’t exist natively in the data and they would not likely have been discovered by turning a massive processing system loose on the underlying variables.

The role of the analyst in identifying important variables is and remains critical to the process. Perhaps someday machine learning systems will have evolved to the point where they can intelligently identify a wide-range of potentially interesting meta-data points. This would be quite close to thinking. Such systems do not exist today. Yes, some types of algorithms might resuscitate by brute force some of the meta-data relationships described above (the classification of destinations, for example), and might even improve on them in some respects.

This, however, is by no means certain.

It is far more likely that they will be drowned out by the impact of variables that are more concisely aggregated. A machine learning system might begin a work out a pattern of significance between origin and destination (for example) that implictily captured the distance factor. On the other hand, it might not or it might only capture it in a few very popular cases. By creating the meta-data distance variable, we greatly magnify the ability of the analysis to model the actual factor.

The exploration of relevant variables has been and remains one of the core functions of an analyst. It is a step in which great art resides and it is the key to good analysis. The advent of systems that can analyze very large numbers of variables might, at first glance, appear to greatly diminish the importance of this step. That they do not is a tribute the vast complexity of the world and the impossibility of exhaustive search or unguided exploration. If machine-learning systems do ever subsume this function it will not be by brute force, but by the creation of processes for data categorization that mimic the intuition of the analyst.

Until then, it is our art which must prevail.

Republished with author's permission from original post.

Gary Angel
Gary is the CEO of Digital Mortar. DM is the leading platform for in-store customer journey analytics. It provides near real-time reporting and analysis of how stores performed including full in-store funnel analysis, segmented customer journey analysis, staff evaluation and optimization, and compliance reporting. Prior to founding Digital Mortar, Gary led Ernst & Young's Digital Analytics practice. His previous company, Semphonic, was acquired by EY in 2013.


Please use comments to add value to the discussion. Maximum one link to an educational blog post or article. We will NOT PUBLISH brief comments like "good post," comments that mainly promote links, or comments with links to companies, products, or services.

Please enter your comment!
Please enter your name here