The Shape of the Analytics Technology Stack : More Thoughts from X Change


Share on LinkedIn

One of the most common discussion points at this year’s X Change concerned the right shape for today’s analytic technology stack. This isn’t so much a question of vendors (fortunately, since at EY we have to be very careful about vendor specific recommendations) as it is technology type. It matters much less to most people in our industry which data warehouse appliance you choose, but it matters very much whether, for example, a data warehouse appliance is the right type of tool for a certain part of the technology stack. Nor is this all about hardware. Similar questions exist about the role of a statistical analysis tool, of programming languages, of SQL, and of data visualization tools.

I can’t hope to answer all of those questions in a single post nor is it reasonable to think there’s just one answer for every situation, but I did want to outline what I think were some of the most interesting discussions and conclusions that I took away from X Change around the shape of analytics technology.

Lesson #1: Stay out of my Sandbox

When an organization builds out a technology stack around analytics, one of the key decision points revolves around the role of the primary analytics platform. On the one hand, there’s a strong belief in the analytics community that the analytics platform needs to be a “sandbox” – by which we mean that it needs to be a place entirely under the control of the analytics team unrestricted by formal schemas, change requirements, limitations on queries, and all the other maddening apparatus of IT governance. Many of our clients have struggled as they operationalized analytics and began to build reports or production jobs on the analytics platform. Not only do those production jobs often go awry (because the environment is poorly controlled), but the problems that arise often lead to crippling restrictions on the use of the analytics platform. So the need for a sandbox is real. On the other hand, the whole point of doing analytics is to operationalize the results. If your analysts come up with a great segmentation, you need to be able to use that segmentation. This conflict between the need for a sandbox and the need to operationalize analytics is a big deal. This is one issue where I feel like I’m in the mainstream of the analytics community: the need for a “sandbox” environment with all that implies is real. So is the need to operationalize out of (but also OUTSIDE of) that environment. Put the two together, and there’s a clear need for a well-defined process of moving analytics into reporting and operational systems and, in most cases, a clear case for systems both below and beside the analytics platform to support reporting and personalization systems.

Lesson #2: Programmers Wanted

As I discussed in my post on data science, the analysis of digital data (and other big data analytic areas) requires a different kind of analytics – an analytics where the order of events, the time between events, and the pattern of events are the significant components of analysis. Where order, time-between, and pattern are significant, tools like SQL are poor performers. While SQL can accomplish (almost) anything, it can be very challenging to code certain types of queries in SQL. I’ve seen skilled SQL programmers spend days constructing a massive query that could be duplicated in about 30 minutes and will run much faster with C#. That isn’t because the SQL guys aren’t sharp and it certainly isn’t because SQL is a poor tool. But like any tool, SQL has its strengths and weaknesses. It isn’t the best tool for every job. It isn’t a good choice for building custom analytics and it isn’t a good choice for stream ETL.

SQL and standard ETL aren’t the only casualties. Big data analytics will often demand significant modeling that isn’t available in canned routines in SAS or SPSS. Oh, and I don’t think that most machine-learning algorithms are right for the job either. This is really important. It means that much of the interesting analytics you’re going to want to do on a big data system will have to be built from the ground-up.

So what are the good options for all this ground-up work? Frankly, I’m not sure there is an alternative to full-on programming languages.

I know that isn’t great news. Programming is expensive, error-prone and difficult to maintain. It’s also a fairly uncommon skill among analysts. All bad things.

But we have to deal with it. Programming is going to be a critical in-team (non-offshore) skill. Big data applications routinely need programming skills in both ETL and analysis, and I don’t see that changing any time soon.

If you’re thinking about technology platforms, it’s worth noting that not every language is supported on every platform and, depending on your history and preferences, C++ may look better or worse than C# or Java or Python.

Lesson #3: What, my massively scalable big data system isn’t good for reporting and every kind of analytics?

It seemed like almost every enterprise at X Change that had invested heavily in Hadoop systems in the last 12-18 months had come to a very similar conclusion: some kind of analytics and ETL are great in these systems, but other kinds not so much. And when it comes to driving reporting and visualization off of these systems, it has been difficult or flat-out impossible. A lot of this, in my opinion, comes back to what makes big data challenging in the first place. When you’re doing the right kind of analytics (order, sequence, pattern based), these are the right kind of systems. When you’re not, they aren’t necessarily an improvement. Not every problem is a big data problem and not every kind of analytics works better on a big data platform. Particularly when it comes to reporting, significant aggregation is nearly always essential to deliver adequate performance and that hasn’t yet changed. Most shops seem to have ended up using their big data environment as a new, highly flexible way to generate cubes for reporting. That’s a bit disappointing, but welcome or not, that seems to be place we’re at.

And speaking of big data, the Conference season comes to a close for me in early November at IBM’s Information On Demand Conference where I’ll be presenting my theory of “big data” (in 20 minutes or less). I love this presentation. If you’re there, check it out!

Republished with author's permission from original post.

Gary Angel
Gary is the CEO of Digital Mortar. DM is the leading platform for in-store customer journey analytics. It provides near real-time reporting and analysis of how stores performed including full in-store funnel analysis, segmented customer journey analysis, staff evaluation and optimization, and compliance reporting. Prior to founding Digital Mortar, Gary led Ernst & Young's Digital Analytics practice. His previous company, Semphonic, was acquired by EY in 2013.


Please use comments to add value to the discussion. Maximum one link to an educational blog post or article. We will NOT PUBLISH brief comments like "good post," comments that mainly promote links, or comments with links to companies, products, or services.

Please enter your comment!
Please enter your name here