Last month, I wrote about the three skills needed to practice data science. Based on a factor analysis of many different skills, data science skills fall into three broad skill areas. These skill areas are: 1) subject matter expertise, 2) technology/programming and 3) statistics/math. Data science is essentially a way to extract insight from data using these three skills.
Within each of these skill areas, there are specific skills on which data professionals rely to get at that insight. You might think of these skills as different tools in a data science utility belt. Let’s take a closer look at each of these skill areas.
Subject Matter Expertise
Data science doesn’t happen in a vacuum. Your ability to use technology and statistics to get insight always starts with interest in a specific domain. You can think of subject matter expertise as a body of knowledge that you need to possess to tackle a domain-specific problem. In the business world, you’ll need expertise regarding how businesses run. If you’re tackling a problem in education, you’ll likely need expertise in the education arena. If you’re trying to cure cancer using Big Data technology, you will definitely need to possess expertise in oncology; knowledge in oncology will help you collect the right data and understand what that data is telling you.
Another important data science skill area is that of technology/programming. Skills in this area are necessary in order to get at the data you need. The list of specific skills includes the following:
- Back-End Programming (e.g., JAVA/Rails/Objective C)
- Database Administration (MySQL, NOSQL)
- Systems Administration (e.g., UNIX) and Design
- Cloud Management
- Big and Distributed Data (e.g., Hadoop, Map/Reduce, Spark)
- Managing structured data (e.g., SQL, JSON, XML)
- Managing unstructured data (e.g., noSQL)
- Natural Language Processing (NLP) and text mining
- Data Management (e.g., recoding, de-duplicating, Integrating disparate data sources, Web scraping)
The final set of skills that are needed to extract insights from data are related to statistics and math. These types of skills are necessary in order for you to extract the insights from whatever data set you are using. These skills include:
- Statistics and statistical modeling (e.g., general linear model, ANOVA, MANOVA, Spatio-temporal, Geographical Information System (GIS))
- Bayesian Statistics (e.g., Markov Chain Monte Carlo)
- Science/Scientific Method (e.g., experimental design, research design)
- Data Mining (e.g. R, Python, SPSS, SAS) and Visualization (e.g., graphics, mapping, web-based data visualization) tools
- Optimization (e.g., linear, integer, convex, global)
- Math (e.g., linear algebra, real analysis, calculus)
- Machine Learning (e.g., decision trees, neural nets, Support Vector Machine, clustering)
- Algorithms (e.g., computational complexity, Computer Science theory) and Simulations (e.g., discrete, agent-based, continuous)
- Graphical Models (e.g., social networks)
- Communication (e.g., sharing results, writing/publishing, presentations, blogging)
Focus on One Data Science Skill Area
We found that skills within each of the skill areas are highly related to each other. A factor analysis of the specific skills showed that specific skills within a given skill area (e.g., math/stats) are closely associated with each other; if you’re good at statistics and statistical modeling, you’re likely good at algorithms and simulations.
There is little overlap, however, among the three broad skill areas. We found that proficiency in one skill area is weakly related to proficiency in the other skill areas (see Figure 1). That is, possessing proficiency in one area does not guarantee proficiency in the other two areas. For example, the correlation between business and math/statistics proficiency is only .27, meaning proficiency in those two content domains have very little in common (they share only 7% of their variance with each other). The two skill areas that have the greatest overlap are technology/programming and math/statistics as r = .57.
If you want to get into the field of data science, the current results suggest that you are better off identifying where you are most proficient and focusing on expanding your knowledge within that skill area. If you are good at statistics, you might consider becoming an expert in related skills. If you are good at programming and technology (or, at least, have an interest in those areas), you might consider specializing in skills related to computer science. The data scientist who is an expert in every skill area is non-existent (we didn’t find one). It’s better to think of data science as a team sport. Successful data science projects requires different types of data science experts with complementary skills to work together toward a common goal. Become an expert in one area and work with experts in other areas.