In our study of data scientists, we found that only about a third of them possessed skills needed to handle big and distributed data. These results are in line with findings from other studies that find that data scientists typically analyze small data sets.
We examined the proficiency of data scientists across 25 different data science skills. In this Big Data world, we expected to see many data scientists who possessed skills related to handling big and distributed data. In fact, Big and Distributed Data ranked 24 out of 25 data science skills we studied. In our study only about 31% of data scientists reported having, at least, intermediate proficiency in Big and Distributed Data (see Figure 1). The rest of the respondents (69%) indicated they lacked the skills (would need assistance) to work with Big and Distributed Data.
Size of Data Sets
It turns out that, even in the era of Big Data, data scientists tend to analyze relatively small data sets. In a 2013 report by O’Reilly, researchers found that data scientists rarely work with data sets that are larger than gigabyte-sized (See Figure 2). Even though the skills of the data scientists were related to the size of the data sets they typically use (those with Big Data skills worked more often with larger data sets), they still frequently worked with data sets that were gigabyte-sized or smaller. Very few data scientists actually worked with extremely large data sets.
A more recent survey (August 2015) of data scientists by KDNuggets found similar results; in this study, data scientists reported that they analyze data sets that are relatively small (See Figure 3). Specifically, the results showed that only a few data scientists say they have analyzed extremely large data sets, those in the terabyte (~19%) or petabyte (~5%) range. In fact, most data scientists (~56%) reported that they tend to analyze data sets in the gigabyte range. The remaining data scientists (~22%) analyze megabyte-sized data sets. The size of the data sets that these data scientists have analyzed has remained pretty constant since 2013.
Finally, even organizations may be preventing the need for skills in Big Data. A report from a study by Dresner Advisory Services in late 2015 found that only 17% of organizations actively use Big Data in their organizations today (See Figure 4). Nearly half (47% of organizations say they may use Big Data in the future and 36% said they have no plans for Big Data.
Even in this age of Big Data, most data scientists tend to analyze data sets that can easily fit on a laptop’s hard drive. Not surprisingly, because these data can be easily analyzable within that environment, most data scientists do not need skills in Big and Distributed Data (e.g., Hadoop and MapReduce). The size of data sets that data scientists analyze have remained pretty constant, suggesting that learning skills in Big and Distributed Data might not be a data scientist’s first priority when learning new skills.