I’m all for helping educate the world about the power of data and analytics. I believe that the power of data science can help businesses and citizens understand how the world works. Data science capabilities can help business leaders leverage their vast amounts of data to get insights to improve how they manage their business and can help citizens make better decisions to improve their lives. So, whenever I’m asked to share my views on the topic of data science, I jump at the opportunity.
Data Science for All: It’s a Whole New Game
This week, I’ll be participating in a panel discussion on theCube in support of the IBM event, Data Science for All. For the panel discussion, I will be joined by Jennifer Shin, Dion Hinchcliffe, Joe McKendrick, Joe Caserta and Jen Underwood. After the panel, you can view the 75-minute Data Science for All broadcast, which is aimed at helping business leaders address the challenges of their data environment with the right strategy, plans and tools. The broadcast will be hosted by TV personality Katie Linendoll and will include talks from Rob Thomas and Daniel Hernandez (both from IBM Analytics), Nate Silver (founder of FiveThirtyEight), Michael Li (Founder and CEO of Data Incubator), Tricia Wang (Co-founder of Sudden Compass) and Nir Kaldero (Head of Data Science at Galvanize). Sign up to view this event here.
Demystifying Data Science
Last month, I spoke at Metis‘ online event for aspiring data scientists, Demystifying Data Science. The one-day event, held in September, included 20+ speakers, including such luminaries as Kirk Borne, Deborah Berebichez, Carla Gentry, Aylee Nielsen and Chris Albon. I highly recommend that data professionals who want to get into the field of data science view these talks (you can register to view them here).
In my talk, “The Practice of Data Science,” I provided a high-level overview of what it means to practice data science by taking a look at the people, processes and tools that underlie the field of data science. The text of this talk appears below (the accompanying slides appear at the end of this post).
The Practice of Data Science
Generally speaking, data science is a way of extracting value and insights from data using the powers of computer science and statistics applied to a specific field of study. Data science professionals help humans make better decisions and build algorithms to optimize outcomes. Through the collection, analysis and interpretation of data, data science professionals extract empirically-based insights that augment and enhance how humans and algorithms work.
There are two reasons why data science is so important today: The explosion of data and computing power. We live in a Big Data world in which we are quantifying everything. From the number of steps we take in a day and the purchases we make to the tweets we post and the ads we click, we are swimming in a world of a lot of data. Couple that data with today’s processing power of GPUs and high performance computing, and we are faced with a need to develop data professional who can leverage both data and technology to uncover insights and knowledge about the world in which we live.
The number of data professionals who possess data science skills, however, has not kept pace with the explosion of data we are generating and accumulating. We have a huge data science skill gap. For example, while 23% of educators say that only 23% of all grads will have data science and analytics skills, 69% of employers say they prefer job candidates with these skills over candidates without these skills. In fact, IBM predicts demand for data scientists will increase 28% By 2020.
The remaining talk on data science is divided into three areas: The People, The Process and The Tools.
1. The People
I conducted a study a couple of year ago with AnalyticsWeek in which we surveyed over 500 data professionals to ask about the work they do. We wanted to understand their data science skills, job roles and more. First, let’s talk about the people, the data science professionals.
With respect to job roles, we uncovered three different roles that data pros hold (Researchers, Bus Mngr, Creatives, Developers). The most popular job role was “Researcher” defined as a statistician, scientist), followed by domain experts, creatives and developers. We found that, about half of data professionals hold only one job role.
Data Science Skill Domain
Now, I’m sure you’ve all seen this Venn diagram about the three broad skills behind data science. In our study, we asked data professionals to indicate their proficiency across 25 specific skills across 5 domains, Business (or domain knowledge), Math/Stats, and Technology and Programming.
25 Data Science Skills
Here is a chart that ranks the 25 data science skills by proficiency levels. Our respondents indicated that they are most proficient in communication, managing structured data and data mining and viz tools. I was surprised to see that many of the data professionals lacked proficiency in some data science skills like big and distributed data and cloud management.
Skill Proficiency by Role
Here we see the proficiency in skills by job roles. We see that not all data professionals are created equal. Data professionals who call themselves researchers are strong in math and stats. Data pros who call themselves developers are stronger in technology and programming, Business managers (or domain experts) are stronger in business-related knowledge.
This graph illustrates why I’m not a big fan of the term, “data scientist” because the term is vague. I have a twin brother who is also has the job title of “data scientist.” Yet he and I have entirely different skill sets when it comes to the data science process. He has a degree in computer science and has technology and programming skills. I have a degree in psychology and have math and stats skills.
Data Science Unicorn
Finding a data professional who is proficient in all data science skill areas is extremely difficult. In our study, we looked at the number of skills in which data pros have, at least, an advanced level of proficiency. As our study shows, data professionals rarely possess proficiency in all five skill areas at the level needed to be successful at work. The chance of finding a data professional with Advanced skills in all five areas (even in 3 or 4 skill areas) is rare. The chance of finding data pros with Expert skills in all five skills is akin to finding a unicorn; they just don’t exist.
2. Process of Data Science
The value of data is measured by what you do with it. Data science is about extracting valuable insights from data. As part of their analytics, data mining or data science workflows, data professionals used structured methods to get at those insight.
CRISP-DM, SEMMA and KDD
A survey by KDNuggets showed that the top methods used by data professionals to extract value from data included CRISP-DM, followed by SEMMA and KDD. Each of these methods describe a workflow process that includes steps related to data selection, preparation, modeling and model deployment. The primary difference among these three methods is that CRISP-DM includes a step regarding the need to have a solid business understanding to help guide the subsequent steps. These models are similar to a more general model of discovery, the scientific method.
Scientists have been getting insight from data for centuries using the scientific method. Formally defined, the scientific method is a body of techniques for objectively investigating phenomena, acquiring new knowledge, or correcting and integrating previous knowledge. The scientific method follows these general steps:
- Formulate a question or problem statement
- Generate a hypothesis that is testable
- Gather/Generate data to understand the phenomenon in question. Data can be generated through experimentation; when we can’t conduct true experiments, data are obtained through observations and measurements.
- Analyze data to test the hypotheses / Draw conclusions
- Communicate results to interested parties or take action (e.g., change processes) based on the conclusions. Additionally, the outcome of the scientific method can help us refine our hypotheses for further testing.
I think that the term, “data science” is redundant. All science uses data. That is how scientists test their ideas. They collect data and see if their hunches are supported. Data is, has been and forever will be at the heart of science. The scientific method necessarily involves the collection of empirical evidence, subject to specific principles of reasoning. That is the practice of science, a way of extracting knowledge from data. Data science is science.
As Carl Sagan said, “Science is a way of thinking much more than it is a body of knowledge.” The scientific method is a way to help us understand how the world really works. To be of real, long-term value to business, analytics needs to be about understanding the causal links among the variables you’re studying. Through trial and error, the scientific method helps shed light on identifying the reasons why variables are related to each other and the underlying processes that drive the observed relationships.
Iterative Process of Discovery
While I stress the scientific method as a way of extracting insights, it’s important to note that there are generally two types of inferences: Deductive and Reasoning. Deductive reasoning moves from a general premise to a more specific conclusion (what you predict you’ll see in your data). Inductive reasoning moves from data to a general conclusion.
In science, there is an interplay between inductive inference (based on observations) and deductive inference (based on theory), until we get closer and closer to the ‘truth.’
Scientific Method and Data Science Skills
When I map the three data science skills against the five steps of the scientific method, it’s clear why data science skills are so important in extracting insight from data. As you can see in this proficiency in each of the three data science skills is required to successfully implement the scientific method as a way to get insights from data. Business knowledge is necessary to help formulate the right questions, generate hypotheses, gather data and communicate results. Technology/Programming skills are needed to gather/generate data and analyze data/test hypotheses. Finally, Statistics/Math skills are necessary to gather data, analyze data/test hypotheses and communicate results.
If we cross the 5 steps of the scientific method with the three data science skill domains, we see how different data professionals can contribute to a data science projects. We need the domain expert to help formulate questions, generate hypotheses and communicate results. We need the developer to help us get access to the data we need. Finally, we need the researcher to help generate/gather data, analyze the data and communicate the results.
3. Top Data Science Ecosystem
Data professionals rely on tools and platforms to help them extract insights from their data. While tools are useful one specific activity along the data science life cycle (e.g., data viz, data analysis, data integration), data science platforms provide a central hub in which all data science activities can be accomplished.
Data Science Tools
Data professionals use different tools to analyze their data. A couple of recent surveys from KDNuggets and Rexer Analytics found that the top data science tools used by data pros include: R, Python, SQL, IBM SPSS and SAS. Here is a comprehensive review of different data science tools, .
Data Science Platforms
Data science platforms help diverse data professionals work together in their data science projects, from integrating, exploring and analyzing data to building and deploying data models to enhance applications or augment human judgment. Gartner and Forrester reviewed different data science platforms along a few dimensions and found that there are a handful of vendors who are considered leaders in this space. These include IBM, SAS, RapidMiner and KNIME. Here is a good review of data science platforms.
I included additional slides on the practice of data science that are related to important skills, the role of formal education and gender diversity. For those of you who are interested hearing my comments, please register to view the talk here.