Creating a Customer Data Model for Digital (Web Analytics) Data: Handling Product dimensions


Share on LinkedIn

A Two-Tiered Segmentation with RFM metrics is an excellent framework for modeling digital data in the Customer Data Warehouse. With a Two-Tiered Segmentation, you aggregate large amounts of event data into a small number of fields that are truly meaningful from a marketing perspective. The RFM fields capture how often a customer does each type of visit, how successful they are in that visit type, and how recently those visits occurred. The beauty of this approach is that it captures a great deal of what is actually interesting in the event-level data in a very compact form; a form that can be tied to a customer as a one-to-one relationship. This one-to-one relationship makes it easier to understand the data, easier to do selections for targeted marketing, and easier to do customer-level analytics.

With all the virtues of this framework, however, it’s missing one very important dimension. Because the basic level of aggregation is a "visit-type", the data model doesn’t support queries that need to go beneath that level. In the retail world, for example, this model can’t answer questions like "What product is a customer most interested in?" or "What category of product did a customer view most recently?" For a media site, it can’t answer questions like "What interest area is most viewed by a customer?" Not every industry has this issue. Mono-focused sites (many Pharma sites, some Financial Services sites, etc.) don’t have this deeper dimension. That said, when this deeper dimension exists, these are vitally important questions.

The problem is that dimensions like product have a very high cardinality (lots of distinct values) and, in most cases, a many-to-one relationship to the customer. A typical customer buys more than one product and views many products; how can that be represented in a single customer record? Remember, we don’t want an aggregation like "Customer X viewed Y Products and bought Z". We need to know the actual Products!

As I said in my last post, this isn’t an easy problem to solve. I don’t have a magic wand to wave here, but I do have a couple of approaches that are worth considering.

Let’s look at a brute force approach first – an approach that will sacrifice the one-to-one relationship between web data and the customer record.

Let’s suppose I create a table like this:

Product Interactions Table


Product ID



What I’ve done here is a create a single row for every product interaction. The row is keyed by Customer ID and Product ID. The TimeStamp captures when the action happens. The Action field captures the type of Product Interaction (View, Cart, Purchase, Cancel, etc.). This is about as compact a representation of product history for a given customer as can be managed. In theory, I could ditch the TimeStamp and use Action as a representation of the furthest state the customer has achieved in a product interaction (Cart supersedes View, Purchase supersedes Cart, etc.). I think that lacks clarity and sacrifices interesting information for a minor gain, making it a poor tradeoff. If I’m going to go to a many-to-one relationship of Product interactions, I don’t see much point in making it very lossy.

If this is as compact as a Product Interaction representation can be, is it too compact? You’ve probably noticed that I’ve left off a visit identifier. I’ve also left off additional fields that might color that type of interaction. I could, for instance, create a product interaction record that looks like this:

Product Interactions Table (Extended)


Product ID



Visit ID

# of Product Pages Viewed

Time Spend on Product


In and of itself, Visit ID doesn’t add much to our knowledge. However, if I created a record that captured deeper visit Information then having a visit key in the Product Interaction Table might add real value:

Visit Record


Visit ID

Visit Type Code

Visit Type Success




For digital, cost is typically a visit characteristic. By creating a Visit record with cost, I can now answer questions like "How much does it cost to get a view on Product X?" which is certainly worth knowing.

I’m not going to make a definitive judgment about the need for additional fields in the Product Interaction table. They clearly add value in at least some cases but I suspect that the appropriate fields need to be discussed on a case by case basis for a specific business.

For now, it’s enough to understand that a Product Interaction Table can be quite compact, but sacrifices the one-to-one relationship with Customer that we are striving for. Because it is relatively non-lossy, however, it can (particularly in Extended Versions) answer nearly any question about Product interactions.

This structure could certainly support questions like:

"What product did Customer X view most recently?"

"What product has the highest conversion rate for Customers of Segment Y?"

"Which customers are most interested in Category X Products (by View)?"

and so on. It’s rich and flexible. But building the queries to answer those questions isn’t as easy or as efficient as it would be if we could aggregate the data into a one-to-one relationship with Customer.

Here’s an approach that uses our nascent Visit Record to capture aggregated Product History.:

Visit Record (Extended)


Visit ID

Visit Type Code





Purchased, Carted and Viewed are all meant to be lists of SKUs. Of course, using text strings is a bit of cheating and certainly makes for its own kind of clumsiness. I could just bite the bullet and normalize this as:

Product Visit Table

Visit ID


Product ID

All I’ve really done here (compared to my original Product Interactions Table) is change Customer ID to Visit ID and remove the TimeStamp. I should be able to get rid of the Timestamp because I have the Visit TimeStamp to rely on. Dropping TimeStamp is a bit lossy (I can no longer answer intra-visit questions about Product viewing order), but probably acceptable. On the other hand it introduces another join into the equation every time I make a query. I have to join Visitor to Visit and Visit to Product Visit.

This is all just traditional data modeling. I’d like to be cleverer and find a real aggregation of the sort represented by the Two-Tiered segmentation. The Two-Tiered Segmentation REPLACES the original data with a different, more meaningful construct. Everything I’ve been doing here just changes the access path to the same data.

One technique that might work is customer segmentation via clustering. I’ve written in the past about Behavioral Segmentation, and one of its unique properties is that it can represent many complex dimensions of customer behavior in a single scored or coded variable.

Suppose you built a product interest segmentation. The goal of the segmentation would be to consolidate customer interest into a single code or score. Here, for example, is a Travel Industry Behavioral Segmentation that I’ve shown before.

Sample Segmentation

In this case, a visitor will have a single Segment code that represents a roll-up of their entire product behavior on the site. In theory, segmentations of this sort can be extended to provide deeper coverage. An online retailer, for example, might have a set of segmentations for each major category. In effect, the Customer Data Record would look like this:

Customer Record


Visit-Type RFMS

Electronics Segment

Appliance Segment

Apparel Segment


Not only does this structure allow for some very fine-grained queries:

"Find all Customers who are High-End Electronics Shoppers"

"Find all Customers interested in Mens&Womens Apparel"

It can be extended by using scores for each Segment to create queries like:

"Find all customers most interested in Electronics"

"Find all customers who are more interested in Appliances than Apparel"

"Find all customers who are very interested in high-end Electronics but not high-end Appliances"

All very cool. Segmentation has the additional virtue of helping the marketer figure out interesting questions. A good segmentation isn’t just an aggregation – it’s a guide to action.

Of course, it’s also true that segmentation is VERY lossy. With segmentation, you still can’t answer basic questions like:

"Find all visitors who viewed Product x?"

You really can’t get by with a model that won’t answer that question. So while I think segmentation is a good technique, an important technique, it’s just too lossy to be a complete answer.

Here’s another classic approach (one widely frowned on by data modelers – but I’m not going to let that stop me). I’ll call it the Top X approach. The idea behind the Top X approach is simple. Even when we have a many-to-one relationship, it’s often the case that only a few of those relationships are significant or necessary to understand.

For instance, we might create a customer record that looks like this:

Customer Record


Visit-Type RFMS

Top Product Viewed

2nd Product Viewed

Most Recent Product Viewed

2nd Most Recent Product Viewed


In effect, we take our RFM approach (I didn’t show Success – the M dimension but it would be there) and use it for the Top X products where X is some arbitrary (very small) number. The Top X approach can be applied to meta rollups as well (such as product categories). That’s important because a Top X product list, in and of itself, can’t be rolled up into a Top Categories list – you have to do that independently using the detail data.

I mentioned that data modelers often hate Top X models and consider them inelegant and clumsy. It’s true, but I think a Top X model is worth considering in cases where the vast majority of business questions are focused on a single "top" product or interest – whether that be from a Recency, View, Purchase or Category level. If 90% of the queries involve picking the top choice of a customer, then pre-aggregating those in a Top X model will make everyone’s life a lot easier.

Note that you can extend our original Customer Record with a Top X set of fields that are specific to that visit type:


[Visit Type 1 Frequency]

[Visit Type 1 Success]

[Visit Type 1 Recency]

Top Product Viewed

[Visit Type 2 Frequency]

Top Support Method

Let’s suppose that Visit Type 1 is Early-Stage buying. We don’t need Top Products Purchased for that Visit Type because it’s inherently view-based. If Visit Type 2 is a Support Visit, we can pick an entirely different dimension (Support Path) to Top X.

This is a pretty efficient and rich representation albeit still lossy. We can answer almost any "top" question very efficiently. However we can’t answer any questions that extend beyond the TopX, and we can’t answer questions like this:

"Which Product Types have the highest Conversion Rate for Visitor X?"

Nevertheless, I think this last approach will frequently produce the most useful customer data record.

I’ve covered three different techniques for handling product (or similar detail) dimensions when incorporating Digital Data into the warehouse: brute force many-to-one, cluster segmentation, and Top X, as well as several hybrid approaches. None of them are perfect and I’m afraid I don’t have a silver bullet to suggest. I think cluster segmentation is highly appropriate in most cases but is never a complete answer. The brute force many-to-one relationship is probably worth incorporating into most models because it is so flexible and, in relatively compact format, inexpensive from a storage if not from a query perspective. It’s more compact and easily understood that event-level data streams.

And while I realize that Top X approaches can be clumsy, I think the commonness of questions involving a single X instance (Most Recent, Most Viewed, Most Purchased) make it a relatively attractive aggregation particularly in conjunction with the Two-Teired Segmentation metrics.

This isn’t, in other words, a case of either/or. All of these approaches may be appropriate in a single data model.

In a recent post, I quoted a Philosophy Professor who told me that if someone gives you six arguments for something it probably means none of them are really convincing or they would have stopped at one. I feel the same way here. I don’t love any of these approaches the way I do the Two-Tiered Segmentation with RFM model approach described in my last post. That’s a far more elegant solution to a problem than what I’ve laid out here.

Alas, it’s just not a solution to this particular problem. So if you have other ideas, I’d be interested in hearing about them, and if you’re struggling with similar issues, then I hope that the approaches I’ve outlined here are at least somewhat helpful!

In my next post, I’ll show out the ideas laid out in these last two posts can be used to extend a segmentation beyond the Web, and even beyond digital, into a data model for that elusive 360 degree view of the customer we’ve all been hearing so much about!

[Click here for a summary of this extended series on Digital Analytics and Database Marketing]

Republished with author's permission from original post.

Gary Angel
Gary is the CEO of Digital Mortar. DM is the leading platform for in-store customer journey analytics. It provides near real-time reporting and analysis of how stores performed including full in-store funnel analysis, segmented customer journey analysis, staff evaluation and optimization, and compliance reporting. Prior to founding Digital Mortar, Gary led Ernst & Young's Digital Analytics practice. His previous company, Semphonic, was acquired by EY in 2013.


Please use comments to add value to the discussion. Maximum one link to an educational blog post or article. We will NOT PUBLISH brief comments like "good post," comments that mainly promote links, or comments with links to companies, products, or services.

Please enter your comment!
Please enter your name here