Data Science Skills Needed in a Big Data World

1
3806

Share on LinkedIn

Last month, I wrote about the three skills needed to practice data science. Based on a factor analysis of many different skills, data science skills fall into three broad skill areas. These skill areas are: 1) subject matter expertise, 2) technology/programming and 3) statistics/math. Data science is essentially a way to extract insight from data using these three skills.

Within each of these skill areas, there are specific skills on which data professionals rely to get at that insight. You might think of these skills as different tools in a data science utility belt. Let’s take a closer look at each of these skill areas.

Subject Matter Expertise

Data science doesn’t happen in a vacuum. Your ability to use technology and statistics to get insight always starts with interest in a specific domain. You can think of subject matter expertise as a body of knowledge that you need to possess to tackle a domain-specific problem. In the business world, you’ll need expertise regarding how businesses run. If you’re tackling a problem in education, you’ll likely need expertise in the education arena. If you’re trying to cure cancer using Big Data technology, you will definitely need to possess expertise in oncology; knowledge in oncology will help you collect the right data and understand what that data is telling you.

Technology/Programming

Another important data science skill area is that of technology/programming. Skills in this area are necessary in order to get at the data you need. The list of specific skills includes the following:

  • Back-End Programming (e.g., JAVA/Rails/Objective C)
  • Database Administration (MySQL, NOSQL)
  • Systems Administration (e.g., UNIX) and Design
  • Cloud Management
  • Front-End Programming (e.g., JavaScript, HTML, CSS)
  • Big and Distributed Data (e.g., Hadoop, Map/Reduce, Spark)
  • Managing structured data (e.g., SQL, JSON, XML)
  • Managing unstructured data (e.g., noSQL)
  • Natural Language Processing (NLP) and text mining
  • Data Management (e.g., recoding, de-duplicating, Integrating disparate data sources, Web scraping)

Statistics/Math

The final set of skills that are needed to extract insights from data are related to statistics and math. These types of skills are necessary in order for you to extract the insights from whatever data set you are using. These skills include:

  • Statistics and statistical modeling (e.g., general linear model, ANOVA, MANOVA, Spatio-temporal, Geographical Information System (GIS))
  • Bayesian Statistics (e.g., Markov Chain Monte Carlo)
  • Science/Scientific Method (e.g., experimental design, research design)
  • Data Mining (e.g. R, Python, SPSS, SAS) and Visualization (e.g., graphics, mapping, web-based data visualization) tools
  • Optimization (e.g., linear, integer, convex, global)
  • Math (e.g., linear algebra, real analysis, calculus)
  • Machine Learning (e.g., decision trees, neural nets, Support Vector Machine, clustering)
  • Algorithms (e.g., computational complexity, Computer Science theory) and Simulations (e.g., discrete, agent-based, continuous)
  • Graphical Models (e.g., social networks)
  • Communication (e.g., sharing results, writing/publishing, presentations, blogging)

Focus on One Data Science Skill Area

We found that skills within each of the skill areas are highly related to each other. A factor analysis of the specific skills showed that specific skills within a given skill area (e.g., math/stats) are closely associated with each other; if you’re good at statistics and statistical modeling, you’re likely good at algorithms and simulations.

Figure 1. Descriptive Statistics of and Correlations Among Data Science Skills
Figure 1. Descriptive Statistics of and Correlations Among Data Science Skills

There is little overlap, however, among the three broad skill areas. We found that proficiency in one skill area is weakly related to proficiency in the other skill areas (see Figure 1). That is, possessing proficiency in one area does not guarantee proficiency in the other two areas. For example, the correlation between business and math/statistics proficiency is only .27, meaning proficiency in those two content domains have very little in common (they share only 7% of their variance with each other). The two skill areas that have the greatest overlap are technology/programming and math/statistics as r = .57.

If you want to get into the field of data science, the current results suggest that you are better off identifying where you are most proficient and focusing on expanding your knowledge within that skill area. If you are good at statistics, you might consider becoming an expert in related skills. If you are good at programming and technology (or, at least, have an interest in those areas), you might consider specializing in skills related to computer science. The data scientist who is an expert in every skill area is non-existent (we didn’t find one). It’s better to think of data science as a team sport. Successful data science projects requires different types of data science experts with complementary skills to work together toward a common goal. Become an expert in one area and work with experts in other areas.

1 COMMENT

  1. Bob –

    This is terrific insight into the rapidly changing world of customer analytics in particular, and business analytics in general.

    Even though, as you note, your analysis found that “proficiency in one skill area is weakly related to proficiency in the other skill areas”, from my perspective, that is evolving. Let’s say, for example, that someone has strong proficiency in financial services customer experience analysis. The individual would certainly need a background, and have interest, in financial services marketing and customer behavior. The person would need to have enough comfort with technology and programming that he/she could communicate with, and understand, what is being provided by tech experts here. Finally, this individual would need deep skills in stat and math, particularly in data mining, modeling, and predictive analytics to leverage the data for insights.

    My analogy is the job of a winemaster (not just a winemaker) at a winery. This individual must know vineyard management for different varieties (though he or she doesn’t actually plant the vines and grow the grapes), proper harvesting methods, and all of the highly detailed, scientific mechanics of consistently producing a superior wine. The winemaster is a subject matter expert, is sensitive to the technology associated with making wine, and is, as well, a scientist and mathematician.

ADD YOUR COMMENT

Please use comments to add value to the discussion. Maximum one link to an educational blog post or article. We will NOT PUBLISH brief comments like "good post," comments that mainly promote links, or comments with links to companies, products, or services.

Please enter your comment!
Please enter your name here