Statistical ETL and Big Data


Share on LinkedIn

I began this series with a short essay on the meaning of big data and why digital data is paradigmatic of big data. In digital, the unit of meaning lies above the level of the individual records we collect and cannot be represented as a simple aggregation of those records. A page view, taken in isolation, is largely without meaning, and no simple count, sum or average of page views will capture the meaning inherent in the behavioral stream. This lack of meaning at the detail level is a huge challenge.

In my last post, I described a re-formulation of the event-level digital data to make it more compact, better structured and more robust in supporting queries around sequence and time. That’s a huge step forward in supporting digital analysis, since sequence and time are often critical to uncovering the meaning inherent in the data stream.

Even with these improvements, however, the lack of meaning at the individual record level and the challenges in understanding what the data in the stream means make querying the data extremely difficult.

To understand why this is, I’m going to circle back on one of the fundamental concepts we’ve brought to digital analytics. The idea is simple: to build effective digital analysis you need to have a basic idea of who somebody is and what they were trying to do online. We call this Two-Tiered segmentation (the first tier is a traditional segmentation and the second tier is the visit intent) and it’s at the heart of our approach. That second part, understanding visit intent, is the key.  You won’t know – can’t know – whether your Website or Mobile App were successful in a given visit unless you understand why the visitor was there.  

Far too many enterprise Websites make the simplifying assumption that every visitor is there to buy, book, or generate a lead. That’s usually not the case – and often it’s so far from the truth as to make aggregated metrics that rely on that mono-purpose assumption completely useless. What’s worse, even when ALL your visitors are on your Website for a single purpose, there are often significant differences in the details of their intent. On an ecommerce site, a visitor may have the intent to buy a specific product, the intent to buy something in a category, or the intent to buy something in some category (a gift giver, for example). Those are three fundamentally different use-cases.

This matters for almost any significant use of your digital data. Do you want to understand the quality of traffic sourced by your campaigns? You need to understand the visit intent of the sourced visitors.

Do you want to measure content effectiveness? Believe me; it’s relative to the visit intent of your visitors.

Do you want to test & personalize? That, too, is directly related to visit intent.

We tell all our clients that a best-in-class digital dashboard will have NO SITEWIDE metrics. Everything will be segmented by audience and intent.

Well guess what, the same story is pretty much true for your analysts! Four times out of five, they don’t need to see all the detail inside a session. What they need to understand is what the session is about and the extent to which it was successful.

But this brings us back to the problem I first attacked – creating visit level aggregations. Because those aggregations are, fundamentally, an answer to what a visit was about. And here’s where things get sticky. If you just leave all your data in the raw event form, then every time an analyst wants to use the data, they’ll have to create some way to getting meaning out of that detail. You can bet they’ll all find different – and probably very simple and unsophisticated – approaches to doing this.

Let me put the problem in a different context.

Suppose you’ve got a database of fitness readings from one of the many wearable devices now extant for that purpose. At the detail level, those readings are not dissimilar to digital data. There’s a timestamp, a geo-location and perhaps a set of physiological readings. In this form, the data is very hard to use. You have to use logic to infer what activity the wearer was engaged in. Was the person walking, riding an elevator, flying in an airplane, stuck in traffic on a freeway, or cooking in the kitchen? Each of these types of activities will be defined by some pattern of motion and physiological readings.

You could leave all the activity in its raw form and then every time an analyst queried the data they could choose a different pattern recognition scheme to delineate some set of patterns.

You could do that. But it would be really, really stupid.

The right approach, of course, is to create a rich analysis that matches behavioral signatures to real world activities and then tags the data so that every analysis uses a shared understanding of what each pattern means. Indeed, you probably wouldn’t just tag the raw data, you’d create what amount to session-level aggregations. Those aggregations, in turn, would be the raw detail data that an analyst would use to create patterns that classify who somebody is and how they exercise.

That’s exactly the way it works in digital data too.

The types of aggregations I suggested in my previous post on session level are far from ideal. They used content categorization and basic counting and summing techniques to create, in my view, very lumpy and imprecise descriptions of visit intent.

When we create a real two-tiered segmentation, we generally build something much more sophisticated and data driven. First, we take a large sample of visitors, survey them, investigate visit intent thoroughly, and tie that to their visit behavior. We then use this to create a statistical model of the behavioral signatures for each type of visit we want to track. These behavioral signatures are then used to score every single web visit and assign a visit intent.

Built off the detail data, this segmentation model will include content categorization (similar to what I described in my data model post but not constrained by the necessity to restrict the columns), but it will take advantage of a much wider variety of classifications. We’d typically use EVERY available taxonomy and classification to at least test to see if they are predictive. What’s more, we’d typically include consumption information (how long someone spent inside every classification) and, at least, some basic level sequencing information about when things happened in the visit.

Applying this type of model is far more complex than traditional ETL counting and summing. However, it’s an essential part of building a really good visit-level detail file. Having a single “best-guess” model of visit intent will make a huge difference in the usability of the data. Not only will it make interpretation of the data vastly easier, it will standardize and dramatically improve the overall quality of that interpretation. This is an analysis you can redo, refine and improve constantly – and all that work will pay off across countless queries and uses of the data. With this form of statistical ETL to create a visit intent description as part of your visit-level detail file, you can put a tremendous amount of effort and sophistication into that core intent model and then you can leverage that effort in every single query you run on the subsequent session-level file.

As I hope I’ve made clear, similar Statistical ETL processes are likely to be essential in ANY true big data application. The very nature of big data is that the detail data needs to be patterned before it can be used effectively. In the vast majority of cases, the patterning should be a foundation for subsequent analysis, not a part of every query you make on the data.

So when I say digital is a paradigm case of big data, I really mean it!

Republished with author's permission from original post.

Gary Angel
Gary is the CEO of Digital Mortar. DM is the leading platform for in-store customer journey analytics. It provides near real-time reporting and analysis of how stores performed including full in-store funnel analysis, segmented customer journey analysis, staff evaluation and optimization, and compliance reporting. Prior to founding Digital Mortar, Gary led Ernst & Young's Digital Analytics practice. His previous company, Semphonic, was acquired by EY in 2013.


Please use comments to add value to the discussion. Maximum one link to an educational blog post or article. We will NOT PUBLISH brief comments like "good post," comments that mainly promote links, or comments with links to companies, products, or services.

Please enter your comment!
Please enter your name here