I was down at Splunk this Friday and they reminded me that a few weeks back Svetlana Sicular of Gartner wrote an excellent piece about the state of ‘big data’ (with, as it happens, a glowing mention of Splunk). At its center was Gartner’s truly wonderful concept of the hype cycle. Instantly recognizable to all of us, the hype cycle looks like this:
Svetlana argues that big data may be passing the “Peak of Inflated Expectations” and entering the “Trough of Disillusionment.” If this sounds like a digital analtyics version of the tots game, Candyland, that just makes it all the more priceless.
“A six. Yay! I’m at the Peak of Inflated Expectations.”
“A three. Oh no! I’m in the trough of disillusionment and have to go back to where I started.”
I’m not completely sure that Svetlana is right about where we stand. Within the sophisticated measurement and enterprise intelligence community, I think she’s dead-on. A lot of analytics pundits now are feeling that same cooling breeze and beginning to take the “Big data? it’s all really the same old stuff” tack as a way to keep themselves ahead of the curve.
But what could change that dynamic is that analytics and big data – for the first time ever – show signs of punching out into the broader business and cultural realm. Within our community, the hype around big data has gone about as far as it can go. But measured across the broader business culture, the hype is still nascent. It could grow much, much bigger if it truly metastasizes from us to the broader community. Extend the hype-cycle concept to say that it’s community specific and you see that hype that leaps boundaries will, gangnam style, re-draw the curve.
However, if we are indeed about to enter the “Trough of Disillusionment”, I’m going to remain contrarian (as is Ms. Sicular if I read her correctly). I’m not here to defend the hype around big data, the sloppy thinking, or the vendor overload. But I believe that the purveyors of “it’s all really the same old stuff” are selling a different form of snake oil. In 2002, after the collapse of a truly epic hype cycle, there were plenty of direct mail guys running around saying ‘I told you so.’ At least now they’ll have their Saturdays off.
No doubt, one of the reasons ‘big data’ is vulnerable to hype is that it’s so poorly defined and fuzzy a concept. The general consensus in the community is that “big data” involves the four V’s: Volume, Velocity, Variety, and Veracity. The IBM Website has a pretty good summary of what I take to be the conventional wisdom.
I don’t much care for this definition. It’s fuzzy, incomplete, and misses the truly key points.
To see why, let’s work backward through that list of V’s.
Is the need for veracity (read data quality) really different in the big data world? I don’t think so. Most big data systems lack anything like the quality and governance of traditional transaction systems and, you know what, they don’t even need that level of quality. So not only doesn’t big data push the envelope in terms of data quality, it often relaxes it a little bit.
Variety comes closer to the truth but it’s also a bit of canard. At least in digital, the range of big data sources is varied and challenging. On the other hand, big data folks often talk about mixing structured and unstructured data and then misdescribe most digital data as unstructured. In the digital space, only social media listening data is truly unstructured. Nearly every other source, including Web analytics data, is structured. What’s more, the vast majority of big data systems and technologies have NOTHING to do with mixing unstructured and structured data. There aren’t really any analytic methods or tools for dealing with unstructured and structured data at the same time and there’s usually not much reason to do it either.
I’m not sold on Velocity either. Are we really getting pushed into big data systems to do real-time analytics? I’m not seeing that. In fact, I’d argue that real-time is a completely different technology stack with its own unique set of problems and solutions.
I’m not going to argue against the last ‘V’ – volume. Now that would be contrarian! But we all know that just saying big data is new and different because it involves lots of data isn’t a good or complete answer. There was lots of data in those credit card transaction systems I worked with back in the ’90s, but I don’t consider them big data solutions.
So if we have one ‘V’ not four and that ‘V’ isn’t very illuminating, you might be left with the increasingly popular pundit’s view that big data really is just the ‘same old stuff’ with higher data volumes.
Except it isn’t.
I’ve expounded Semphonic’s definition of big data before. In its most interesting form, big data is about the move from aggregate level data analysis to detail level data analysis. It’s about the difference between analyzing data in streams and analyzing data as entities.
The vast majority of traditional data analysis techniques work on blocks of data that are discrete entities and have direct meaning. ‘Customer’ is a good example. Traditional marketing segmentations, for example, worked on data at the customer level. The customer’s demographics. The customer’s relationships. The customer’s purchases. The customer’s attitudes. We built segmentations where each record represented a single customer and each field was a discrete representation of that customer.
The vast majority of analytic techniques (not just segmentation) are geared toward this type block, entity-level analysis.
And it isn’t just analysis, it’s reporting too. Almost the entirety of the BI world in both the analytics and reporting senses, are driven by this concept of analysis at the entity level. Cubes work because you can meaningfully aggregate the underlying entities.
So what’s different about big data? In the digital world (and in a number of other big data scenario’s), the data is no longer collected and analyzed at the entity level. Web analytics data (which I consider to be a paradigm case), consists of lots and lots of server calls (page views). You can treat these server calls AS IF they were the entity of meaning. That’s exactly what systems like SiteCatalyst do. They aggregate server calls as if they were meaningful. Unfortunately, they aren’t very meaningful, and this aggregation doesn’t answer most of the important questions marketers have.
When we use Web analytics data, we want to understand what the stream of page views says about the Customer – so just counting page views doesn’t work. But here’s the tricky part – we can’t just add up page views at the Customer level instead of the site level to get the answer. The page view in and of itself doesn’t have any real meaning and summing them doesn’t add meaning. Instead, we need to infer meaning from the pattern of the stream. How broad the stream is, how fast it goes, which pieces flow together, and what comes first and what comes later. These “stream” concepts are what drive things like Semphonic’s Two-Tiered Segmentation (they form the visit intent part) and, believe me, they are different from traditional segmentation.
Let me give you an analogy. You might decide that to better understand how best to swing a bat to hit curveball, you’re going to collect thousands of sensor data points about the location of the hitter’s legs, torso and arm as they swing. This type of analysis makes perfect sense and it’s truly a big data application. Looking at any single sensor data point is completely uninteresting. Adding up data points doesn’t make sense either. You can’t just sum arm movements – that a hitter has 700 different arm movements doesn’t tell us anything about their actual swing. To analyze this data, you need to infer meaning from the stream. That’s the hard part. That’s the part that traditional data analytics techniques DON’T handle.Traditional analytics studied hitters by things like batting average, on base percentages and slugging percentages. That stuff is cake in traditional data and analytic solutions, and these methods are great for deciding which hitter is best. But those methods are completely useless for making any one hitter better. Analyzing swing data? They won’t do that.
Web behavioral data is just the same.
This task of inferring meaning from streams is the real difference between big data analytics and traditional database reporting, BI, and analytics. Yes, along with the fact of streams comes, in many cases, much higher volumes. When you go down to the detail level you almost always create much higher volume. And certainly volume complicates the problem. But it isn’t the core problem. You can have just 10,000 records of stream data and it’s STILL unanalyzable and unreportable with traditional statistical and BI methods.
That’s why a view like this that casts the big data world as the same old problems but with more data and different sources is so wrong. If you’ve never actually done any analysis, that story might seem right. But if you can see why analyzing batting-swing sensor data is fundamentally different than analyzing batting averages, you’ll realize why thinking big data is just more of the same is going to land you deep in that trough of disillusion.
So that’s the Semphonic story. Big data isn’t about the four V’s. But it isn’t just hype and it most definitely isn’t the same old thing. It’s a fundamental shift in the techniques necessary to understand data when you move analysis to a level of detail where individual entities only have meaning in the context of their broader stream.
I hope you feel as I do that this is a beautiful, elegant definition that captures far more of the truth about what big data means (or should mean). After all, we make the definitions, and it’s still early enough in process to decide exactly what we mean by big data.
This shift to detailed stream data really does change everything. It changes what you need from your ETL tools and it significantly increases the demand for complex ETL. It changes the demands you place on your data model and even what a data model is for. If re-shapes what you might expect from your data collection infrastructure. It changes and radically increases the demand for algorithmic analytic solutions. It changes the analytic methods appropriate to the data. It’s why solutions like Hadoop really do matter (and it’s also why languages like Pig are still missing important capabilities). It’s why constructs like Aster’s nPath are really important. It’s why solutions like Splunk don’t look anything like traditional data warehouses.
Big data isn’t just hype (though it’s much hyped and most of the hype is nonsense) and it isn’t the same old stuff. I’m pretty sure I’m not going to spare us all a journey into the trough of disillusionment. But finding a way to intelligently define big data goes hand-in-hand with starting to think productively about big data problems. Knowing why big data is different, and why that matters is the first step forward.
“A five? Hooray! I’ve reached the Slope of Enlightenment.”