Aggregation and Detail in the Big Data World


Share on LinkedIn

In last week’s post, I took another crack at defining what makes “big data” real and not just more of the same with an extra helping of hype. Creating a sound definition of big data may not have a huge amount of practical significance – but it isn’t without importance. Knowing what you’re dealing with and why/whether it’s really different from what you’ve done before has a host of implications for questions around resourcing, process and technology.

But finding a good definition of big data isn’t what I thought most compelling or important in the discussions that triggered these posts. It was the discussion around the nature and role of detail-level data in the big data world. If the theme of last week’s post was that big data skeptics are somewhat right to question the hype but very much wrong to question the reality of big data, today’s post addresses the most common piece of bad advice you’ll get from big data experts.

The Many Advantages of Detail Data

If there’s one thing that virtually every big data vendor and consultancy will tell you, it’s that the key to big data systems is leaving your data at the detail level. Indeed, it’s precisely the ability to do this that makes big data technologies unique. They process enormous amounts of detail data (whether structured or unstructured) fast enough to allow for the data to be kept in its native form. What do you gain from this?

  1. Independence from structure – removing layers of modeling and indexing
  2. Ability to re-define the data model for every analytic problem – critical since most anslysis projects require unique aggregations
  3. The ability to join multiple data sources at the detail level as necessary

These are all legitimately important things – and are all pretty much true. That being said, there’s no bigger mistake you can make in a big-data system than to believe that your data must always live at the lowest level of detail.

Taking a Rule One Step Too Far

The problem, as I see it, is that just as the big data vendors have largely failed to understand what makes big data problems unique, they’ve often failed to grasp what’s involved in the solution to those problems. They understand the technology, but not the analytics.

Just as with my last post on what makes big data truly different, it’s the stream nature of the data that drives my thesis – and though I’m going to use digital data as my example, what I’m going to argue applies equally well to sensor data, meter data, and a host of other important big data use-cases.

The problem starts with the lack of meaning at the event level. In digital, the analysis of page views isn’t interesting. The page is not a significant entity in marketing. Rather, the analysis is all about sequences of pages views – those sequences being a visit or a related part of the visit – in which the visitor is trying to accomplish something. If I’m building a model of the customer journey, I don’t want to capture every page. I want to capture what the sequence of page views was all about and how successful it was.

One of our main practice focus areas here at E&Y is how to use segmentation techniques to identify visit intent – what a sequence of pages tells us about the customer’s intent and interests. Over the past few years (as Semphonic), we’ve developed an entire analytics methodology around this. But whether you use those methods or not, the key point is that almost any analysis you do is going to have to create some level of meaning around the sequence of touches that constitute a visit. If you take the time and trouble to build a complex decision-tree or cluster analysis of visits, that’s an incredibly valuable foundation for nearly every subsequent analysis.

But if you listen to your big data vendor and insist on having nothing but the lowest level of detail data on your box, you’ll have to re-create that full segmentation EVERY single time you want to use it. That’s preposterous.

Think about this. It’s the key point in this post. If you are always constrained to leave all of your analytics data at it’s lowest level, you’re force to re-create EVERY analytics step EVERY single time you want to re-use it. This might be reasonable when it comes to fairly simple techniques like sessionization, but it’s madness when it comes to complex steps like segmentation. What’s more, leaving the data in it’s native detail state puts dramatic limitations on the number of analysts who can productively use the data. Big data environments are difficult enough without adding silly rules to make them harder.

Creating new and permanent levels of data that include not cubes but data aggregated into levels of meaning above the detail level (like a single row per visit based on segmentation) doesn’t violate any aspect of the big data paradigm – it’s essential to it.

This doesn’t mean you don’t need the detail. You do. Not every analytics problem will be captured within or will take advantage of a visit or task-based segmentation. You still have to have the ability to start over, and you’ll use that ability all the time. But a large number of subsequent analytics tasks (including customer journey models) will benefit from having that visit-level segmentation-based aggregation and will be nearly impossible without it.

It’s just a case of people taking something true (don’t create complex data models, cubes, or fixed structures on your big data system) and taking it to a level where it no longer make sense – never have anything but the lowest level of detail on your big data system.

Summing Up

When you have visit-level, cluster coded data, you still have something that is, for all practical purposes, flat. It’s no different than the call-center, call-level data your big data consultant will load onto your system without blinking an eye. But that call-center data isn’t the lowest level of detail possible (unless it comes from call digitalization). It’s data aggregated by the call-center system. If you’ll load aggregations onto your system, why shouldn’t you create them on your system too?

Whether the aggregation happens on your big data box or elsewhere is completely irrelevant. This visit-level segmentation coded digital data is just the kind of data your big data systems will chew up and digest wonderfully.

It’s particularly useful as an integration layer between various types of customer touchpoints.

When you need to join call-center, Web, mobile, and bricks-and-mortar touchpoints, you simply can’t do it at the detailed stream level. Data at that level is too disjoint – completely different for every type of stream.By using segmentation techniques to aggregate streams up to visit intents and success measures, you create a level of data perfect for customer journey modeling.

It’s not detail level data, but don’t let that worry you. That it happens to be easily used, meaningful, and valuable should not be taken as three strikes against it!

If you think your big data strategy needs a re-think, drop me a line. And if you’re in Las Vegas this week for VoC Fusion stop by my presentation on Thursday to chat!

Republished with author's permission from original post.

Gary Angel
Gary is the CEO of Digital Mortar. DM is the leading platform for in-store customer journey analytics. It provides near real-time reporting and analysis of how stores performed including full in-store funnel analysis, segmented customer journey analysis, staff evaluation and optimization, and compliance reporting. Prior to founding Digital Mortar, Gary led Ernst & Young's Digital Analytics practice. His previous company, Semphonic, was acquired by EY in 2013.


Please use comments to add value to the discussion. Maximum one link to an educational blog post or article. We will NOT PUBLISH brief comments like "good post," comments that mainly promote links, or comments with links to companies, products, or services.

Please enter your comment!
Please enter your name here