Top

Data Scientists and the Practice of Data Science

Bob Hayes, PhD | Nov 17, 2015 940 views 3 Comments

Share on LinkedIn

ibminsightpanelpicI was recently involved in a couple of panel discussions on what it means to be a data scientist and to practice data science. These discussions/debates took place at IBM Insight in Las Vegas in Late October. I attended the event as IBM’s guest. The panels, moderated by Brian Fanzo, included me and these data experts:

I enjoyed our discussions and their take on the topic of data science. Our discussion was opened by the question “What is the role of a data scientist in the insight economy?” You can read each of our answers to this question on IBM’s Big Data Hub. While we come from different backgrounds, there was a common theme across our answers. We all think that data science is about finding insights in data to help make better decisions. I offered a more complete answer to that question in a prior post. Today, I want to share some more thoughts about other areas of the field of data science that we talked about in our discussions. The content below reflects my opinion.

What is a Data Scientist?

Data Scientist Skills

Figure 1. The three skills of data scientists

As more data professionals are now calling themselves data scientists, it’s important to clarify exactly what a data scientist is. One way to understand data scientists is to understand what kind of skills they bring to bear on analytics projects. It’s generally agreed that a successful data scientist is one who possesses skills across three areas: subject matter expertise in a particular field, programming/technology and statistics/math (see DJ Patil and Hilary Mason’s takeDrew Conway’s Data Science Venn Diagram (see Figure 1) and a review of many experts’ opinion on this topic.

AnalyticsWeek and I recently took an empirical approach to understanding the skills of data scientists by asking over 500 data professionals about their job roles and their proficiency across 25 data skills in five areas (i.e., business, technology, programming, math/modeling and statistics). A factor analysis of their proficiency ratings revealed three factors: business acumen, technology/programming skills and statistics/math knowledge.

datascienceblogroleskills

Figure 2. Data professionals in different job roles are proficient in different data skills. Click image to enlarge.

A data scientist who possesses expertise in all data skills is rare. In our survey, none of the respondents were experts in all five skill areas. Instead, our results identified four different types of data scientists, each with varying levels of proficiency in data skills; as expected, different data professionals possessed role-specific skills (see Figure 2). Business Management professionals were the most proficient in business skills. Developers were the most proficient in technology and programming skills. Researchers were most proficient in math/modeling and statistics. Creatives did not possess great proficiency in any one skill.

The Practice of Data Science: Getting Insights from Data

Gil Press offers a great summary of the field of data science. He traces the literary history of the term (term first appears in use in 1974) and settles on the idea that data science is way of extracting insights from those data using the powers of computer science and statistics applied to data from a specific fields of study.

CRISP-DM_Process_Diagram[1]

Figure 3. Six Phases of the CRISP-DM (Cross Industry Standard Process for Data Mining) methodology. Download the IBM SPSS Modeler CRISP-DM Guide here.

But how do you get insights from data? Bernard Marr offers his 5-step SMART approach to extract information. SMART stands for:

  • S = Start with Strategy
  • M = Measure Metrics and Data
  • A = Apply Analytics
  • R = Report Results
  • T = Transform your Business

Another approach is the 6-step CRISP-DM (Cross Industry Standard Process for Data Mining) method (see Figure 3). In a KDNuggets Poll in 2014, the CRISP-DM method was the most popular methodology (43%) used by data professionals for analytics, data mining, and data science projects.

These two approaches have a lot in common with each other and both share a lot with a method that has been around for about 1000 years: the scientific method (see Alhazen, a forerunner of the scientific method). The scientific method follows these general steps (see figure 4):

Figure 1. The scientific method is a way to get insights from your data

Figure 4. The scientific method is a way to get insights from your data

  1. Formulate a question or problem statement
  2. Generate a hypothesis that is testable
  3. Gather/Generate data to understand the phenomenon in question. Data can be generated through experimentation; when we can’t conduct true experiments, data are obtained through observations and measurements.
  4. Analyze data to test the hypotheses / Draw conclusions
  5. Communicate results to interested parties or take action (e.g., change processes) based on the conclusions. Additionally, the outcome of the scientific method can help us refine our hypotheses for further testing.

The value of data is measured by what you do with it. Whether you’re investigating phenomena, acquiring new knowledge, or correcting and integrating previous knowledge, the scientific method is an effective way to systematically interrogate your data. Scientists may differ with respect to the variables they use and the problems they study (e.g., medicine, education and business), but they all use the scientific method to advance bodies of knowledge.

Data is, has been and forever will be at the heart of science. The scientific method necessarily involves the collection of empirical evidence, subject to specific principles of reasoning. That is the practice of science, a way of extracting knowledge from data. Data science is science.

The Democratization of Data Science

Taking a scientific approach to analyzing data is not only valuable to data workers; it is also valuable for people who consume, interpret and make decisions based the analysis of those data. In business, data users need to think critically about sales reports, social media metrics and quarterly reports. Application vendors are marketing their tools and platforms as a way of making everybody a data scientist, enabling end users (i.e., data users) to get advanced statistical and visualization capabilities to find insights (see Prelert’s take on this here, Tableau’s ideas here and Umbel’s call here).

I believe that the democratization of data science is not only a software problem but also an education problem. Companies need to provide their employees training on statistics and statistical concepts. This type of training gives the employees the ability to think critically about the data (e.g., data source, measurement properties and relevance of the metrics). The better the grasp of statistics employees have, the more insight/value/use they will get from the software they use to analyze/visualize that data.

Statistics is the language of data. Like knowledge of your native language helps you maneuver in the world of words, statistics will help you maneuver in the world of data. As the world around us becomes more quantified, statistical skills will become more and more essential in our daily lives. If you want to make sense of our data-intensive world, you will need to understand statistics.

Conclusions and Final Thoughts

Businesses are relying on data professionals with unique skills to make sense of their data. These data professionals apply their skills to improve decision-making in humans or algorithms. Getting from data to insights, data professionals can adopt a systematic approach to optimize the use of their skills. Following are some conclusions about data scientists and the practice of data science.

  • The practice of data science requires three skills: subject matter expertise, computing skills and statistical knowledge.
  • The general term, ‘data scientist,’ is ambiguous. Our research studied four different types of data scientists: Business management, Programmer, Creative and Researcher. Each role possessed different strengths.
  • Science is a way of thinking, a way of testing ideas using data. An effective practice of data science includes the scientific method. I think that the term, ‘data science,’ is redundant. It’s just science. Science requires the use of data, data to help you understand your business and how the world really works.
  • Offer employees training on statistics. Giving people analytics software and expecting them to excel at data science is like giving them a stethescope and expecting them to excel at medicine. The better they understand the language of data, the more value they will get from the analytics software they use.

I’ll leave you with some thoughts on data science I shared with Nick Dimeo at IBM Insight.

I would love to hear your thoughts on data scientists and the practice of data science. What do those terms mean to you?

Print Friendly, PDF & Email

Republished with author's permission from original post.


Recent Editor's Picks:


Categories: BlogCustomer Analytics

940 views

3 Responses to Data Scientists and the Practice of Data Science

  1. Andreas Voniatis November 19, 2015 at 12:29 pm (1 comment) #

    To answer your question, in the pragmatic sense most data scientists are professionals that use R, SAS or other statistical software to explore data, where possible, build predictive models and hopefully automate them with machine learning. Many of these come from quantitative backgrounds and usually end up graduating from the data science stream of courses held on coursera. Data science is pretty much as your venn diagram put it, a combination of statistics, software engineering and domain knowledge. Most data scientists will be weakest on domain knowledge so they will be heavily reliant on those with domain knowledge. Data scientists are getting a lot of press right now and soon most peple will realise that data engineers that get the data in are the real heroes.

  2. Michael Lowenstein November 26, 2015 at 6:00 am (1310 comments) #

    One of the things I’ve found most beneficial in working with really proficient data scientists, or methodolotgists as they are sometimes called, is an added skill – creativity. For example, having applied results of a customer advocacy behavior framework, co-developed with a colleague, for several years and with many clients and many studies, it took a creative data scientist to see even more value in our technique. He thought to apply discriminant function analysis (which we labeled ‘swing voter’) to our segmentation results, which provided clients with granular, prioritized action opportunities. It made all the difference between having a method that was merely useful and one that was truly pathfinding.

  3. Graham Hill November 26, 2015 at 3:42 pm (992 comments) #

    Hi Bob

    Thanks for a very interesting article.

    I wonder whether the scientific method as you describe it is really enough to solve business’ trickiest problems. More to the point, I wonder whether data science is capable of creating new and novel business designs that have never been seen before.

    The scientific method as laid out successively by Bacon, Popper and Kuhn revolves around the interplay of induction and deduction. Practically speaking, in induction we observe patterns of frequent, repeated behaviour and we try to identify what the underlying reasons might be through observation and ultimately, through experimentation. This is perhaps the closest to the scientific method as practiced by traditional scientists. In deduction we use the body of known reasons to work out what will happen in a given set of conditions. This is closer to the scientific method as applied by medical doctors and related professions. In my experience good data scientists blend a mixture of the two (perhaps weighted towards induction) to explore data and to predict what might happen next.

    But data science is only as good as the data it uses. How useful would data science be if there was little or no useful data. Probably not all that useful. I suggest there may be a significant role for a third type of logic, abduction, where we observe patterns of behaviour and we try to identify the simplest and easiest explanation that might explain it. Solutions are not build through experimentation or through reasoning, but through an agile, iterative process of step-wise refinement that creates novel solutions that are simultaneously good enough and always getting better. This is the foundation of what has become known as design thinking (see the HBR article by Tim Brown on ‘Design Thinking’ for a primer), which has shown itself to be rather effective at creating novel solutions to messy business problems.

    I would be most interested in your thoughts on the roles of data science and design thinking in driving novel design.

    Graham Hill
    @grahamhill

Add Your Comment (All comments are reviewed by moderator, no spam permitted!)