(Re)Defining the role of the data scientist


Share on LinkedIn

In the last couple years the term data scientist has become an accepted part of the broad analytics space. Everywhere I look, I see clients building data science teams and looking to hire that most elusive of breeds – data scientists. While I have been (and remain) highly skeptical of the term itself, I’m not opposed to the broader shift. As absurd as it may be that renaming a job role from statistician to data scientist triples its cost and increases its voice, it seems to me that this new reality is a better place to be – and not just for data scientists but for the organizations that hire them. Analytics and data-driven decisioning deserved a stronger voice in the enterprise and the correspondingly increased rewards that entails. If grafting a new title onto the role helps create that reality, then I, at least, don’t see much harm.

It does trouble me, however, that many of our clients don’t have a strong grasp on the role of these data scientists in the enterprise and how that role fits within existing analytics groups. It also troubles me when organizations create massive disparities between data scientists and analysts who are working on the same teams and who are largely equal in value. One pernicious consequence of over-investing per-capita in data science can be the loss of all your best analytics talent that isn’t blessed with the right title. That isn’t a good thing at all.

In this post, I want to lay out in broad terms the roles in a modern analytics team, how they fit together, and which, if any, might reasonably be described as data science. This isn’t something I’ve worked through into a polished deliverable, but here’s a quick mental snapshot of the core analytics functions (not including PM, Management, IT support, etc.) common in today’s enterprise:

  Analytics Roles

Intake & Production ETL: The intake of data into analytics systems and the initial base ETL to ready them for use as well as the ETL necessary to operationalize analytics. This function is heavily IT focused and is often done with either traditional ETL tools (like Informatica), with Hadoop open-source processing engines (like Sqoop or Storm), or with SQL engines (including Hive).

Taxonomy & Meta-Data Systems: Someone has to manage the taxonomies, hierarchies and meta-data that drives analysis and reporting in the organization. That someone is all too often a blank space in the enterprise, but as companies begin to recognize the critical role that robust classification plays in the enterprise this role becomes more real. IT organizations have traditionally owned meta-data infrastructure and business has usually owned meta-data definition. But analytics is the proper owner (and key stakeholder) in the definition, creation and maintenance of classification systems.

Statistical ETL & Analytics Foundation: Transforming big data into usable data for the enterprise almost always involves the statistical interpretation of patterns in the data to make them understandable. Exploring and then locking down these patterns into the core building blocks for analysis is tackled by these folks. In a way, I’ve been exploring the role of statistical ETL and building an analytics foundation in this whole broad series. If you don’t standardize the approaches to these big data streams, not only will every analytics be much harder to execute, but each will be built on a shifting surface that makes comparability and repeatability nearly impossible.

Descriptive Analytics & Reporting: With Descriptive Analytics & Reporting, we begin to actually deliver analysis to the business. This role is a traditional digital/business analytics role whose function is to explore data, answer business questions, and drive optimization and controlled experimentation in the enterprise. The tool-kit for this is typically data visualization and basic statistical analysis (cross-tabulation, correlation, variation) as well as our core digital analytics SaaS toolkit.

Model-Based Analytics (Descriptive, Predictive, Prescriptive): Here at EY we tend to describe analytics maturity as moving from descriptive to predictive to prescriptive and to imply that these are fundamentally different techniques. If taken as a statement about the level of maturity around the use of analytics, I think this kind of makes sense. But I’m not really a fan of it when taken as a statement about types of analytics (where the techniques often blend together) nor do I fully accept the underlying and implicit assumption that prescriptive models are more valuable than descriptive models. It may be that a model that is explicitly prescriptive is more intuitively easier to operationalize, but a customer segmentation that is purely descriptive can be used to drive very powerful personalization. To me, the salient role here is that of building business models, be they regression models, decision trees, clustering models, SOMs, etc. And whether or not they are descriptive, predictive, or prescriptive it doesn’t seem to me that the role much changes. I’d say the same thing, incidentally, about “machine learning”. Some people take a broad enough view of what constitutes machine learning that it would clearly fall in this category. Others might be tempted to define it as a narrower set of statistical analysis techniques. But even with a narrower definition, I see no reason why it wouldn’t or shouldn’t fall in this same role and domain.

Data Journalism: At the very best analytics organizations these days, there’s a recognition that the analysts who fulfill these two previous roles (Descriptive & Model Analytics) often aren’t well suited to build the communication vehicles for the data to the rest of the organization. What’s more, there’s a growing realization that the communication of the analytics to the organization is singularly important and is most definitely NOT solved by data democratization. Hence the growing understanding of the need for a data journalism role that involved the compact, powerful communication of analytics findings to consumers in the organization. This role is more often about writing, graphic design, and communication than analytics expertise, but it also demands practitioners who are comfortable with and passionate about data.

Most enterprise analytics organizations already have ETL programmers, descriptive analysts and report builders and modelers of one sort or another. What’s commonly lacking are taxonomy/classification owners, statistical ETL creators and data journalists.

Does the term data scientist fit any of these open roles?

It’s possible to think about a data scientist as a “full-stack” analyst who crosses all of these functions. It’s also reasonable to think of a data scientist as a cross between the last four roles (Statistical ETL, Descriptive Analytics, Modelling and Journalism) or to think about a data scientist as someone who is comfortable with newer machine learning algorithms but who would otherwise probably be described in the same bucket as the modelers – basically as a new tech statistician.

It’s subjective of course, but I don’t like any of these definitions. The idea of a full-stack analyst who spans ETL to Data Journalism is a pipe-dream. Sure, there may be a few people in the world who can do that. Good luck hiring one. Enterprises create departments and functions precisely because most people don’t excel at widely different tasks and, secondarily, because it’s rare that the integration of those tasks outweighs their differential value. If you have someone who’s really good at building models, how much time do you want them to spend on traditional ETL or reporting?

On the other hand, the statistical ETL role can plausibly be described as a data science role. In fact, I think it’s probably the best way to think about data science because it captures a gap in the common organization and skill set of today’s enterprise that is deeply linked to current big data analytic problems.

Here’s an example of what I mean by statistical ETL and how this role might play out.

Today’s utilities are struggling with the one of the paradigm cases of big data – smart meter readings. In the old days, of course, a meter-reader would come around every few months and write down how much gas or electricity you consumed. That data point – 3 month usage – is traditional data. Analysts don’t need help or assistance in baking it into reports or models. With smart meters, however, the situation has changed. Smart meters send very frequent (every 5-15 minutes) updates of usage to create a stream of detailed data.

On the one hand, it’s no trick to use that detailed data either. You can just sum it into a 3 month number and you have exactly the same data you’ve always had without a meter reader to source it (this kind of summing doesn’t work with digital data – one of the reasons why digital is such a good paradigm for big data). Still, one can’t help but feel that summing up 15 minute readings into 3 month readings isn’t the only or best thing you could do with this data.

By having the data in small time increments, we can ask and answer questions that were never possible before. Questions like which customers use unexpectedly large amounts of power on the weekend, have insufficient insulation, should change from electric to gas appliances, would benefit from a space heater, or should get up a little earlier on the weekend suddenly become approachable. These questions aren’t ANSWERABLE in the original meter reader data but they might answerable with smart meter data.

Answering those questions, however, isn’t primarily a traditional model building exercise. Instead, it’s an exercise in pattern matching. I’ve written before that the fundamental difference between traditional data and big data is the introduction of sequence, time and pattern into the basic level of analysis and this is a perfect example. No individual meter read has any meaning. But by measuring the pattern of reads, it’s often possible to understand the underlying behaviors in considerable detail. For example, by looking at the pattern of gas and electric consumption around dinner hours on moderate temperature days, you could identify houses that use electric vs. gas appliances.

Not only is this type of analysis very different in technique from traditional modeling, there is strong reason for it to be done as a foundation for subsequent analysis. In digital, the two-tiered segmentation we recommend is the fundamental set of building blocks that analysts can then use to model other business problems. You wouldn’t want each analyst to make their own visit intent segmentation. You want your best pattern-analysts to identify the patterns, instantiate them over the data, and then provide them to the business analysts and modelers to use in subsequent reports and models.

It’s the same with smart meter data (or IoT data or any other type of big data). The pattern building exercise is distinct from the business reporting and analytics exercise and it fits pretty darn well our intuitive understanding of what a data scientist is.

All sorts of people call themselves data scientists. Enterprises can call any type of analyst or group their data science group. It’s just semantics. But if you’re looking to understand where in the analytics value-chain your organization has a glaring and potentially fatal gap, my guess is that it’s right here in this foundational activity of creating the fundamental building blocks of analytics based on detail data that needs to be patterned.

Understanding this gap and defining it as THE role for data science makes it clear exactly what skills you need for that role (data journalism isn’t one of them) and what kind of person would suit. It also, I hope, helps you assess the relative value of this role and weigh it more reasonably against the other skills in your analytics group. People who can do this type of work are harder to find than people who can do basic ETL or reporting and maybe even harder to find than good modelers. But they aren’t all that exotic and they aren’t clearly more valuable than a good business analyst who can take numbers and apply them creatively to the business to build an optimization program. This is a case where knowing what you need can help you get what you want and, perhaps, even do so a little more affordably.

Republished with author's permission from original post.

Gary Angel
Gary is the CEO of Digital Mortar. DM is the leading platform for in-store customer journey analytics. It provides near real-time reporting and analysis of how stores performed including full in-store funnel analysis, segmented customer journey analysis, staff evaluation and optimization, and compliance reporting. Prior to founding Digital Mortar, Gary led Ernst & Young's Digital Analytics practice. His previous company, Semphonic, was acquired by EY in 2013.


Please use comments to add value to the discussion. Maximum one link to an educational blog post or article. We will NOT PUBLISH brief comments like "good post," comments that mainly promote links, or comments with links to companies, products, or services.

Please enter your comment!
Please enter your name here