KDNuggets.com recently released the results of their annual data mining software usage survey and the top 4 analytic software packages used are open source. The top 13 shown below (I don’t consider Excel a viable data mining solution). Full survey results may be found here.
The top two open source solutions, which complement each other well, are R and RapidMiner. R is a powerful analytic solution that continues to become adopted in the analytic commercial community, but requires programming skills and has a steep learning curve, while RapidMiner is an intuitive point-and-click GUI based analytic solution. Other nice open source analytic solutions are available such as KNIME, WEKA and Orange. Orange is coming on strong of late and shows great promise especially considering it is a Python-based solution which allows for easier integration and flexibility. As Orange continues to build out its analytic functionality, I expect user adoption to grow.
Big data integration into distributed data environments, such as Hadoop, are where the open source (community) solutions lead the charge. Not only can you integrate R and RapidMiner within Hadoop you can ‘push’ some of the analytic processing directly to Hadoop through solutions such as RHadoop and RHive (R packages) and Rahoop (RapidMiner add-on). Similar solutions are also being created for the big data environments Cassandra and MongoDB.
Are commercial analytic solutions like SAS and SPSS being phased-out? Absolutely not. First, the analytic community is growing at a fast pace, and it is good to have a variety of solutions available, so even if the survey above shows that they may be slightly losing market share the analytic user community is growing at a great clip. Also, some of the commercial analytic tools have deep user bases and certain segments of those users will not defect to only open source solutions any time soon for a variety of reasons. However, all commercial analytic solutions need to evolve quicker as the open source ‘community’ continues to press forward at a rapid pace.
I have heard viewpoints pro-and-con for both the open source and the commercial analytic solutions. Regardless of your allegiance, you will most likely agree that the competition is healthy. I’m not sure what solutions will top the list in 5 years but I do know that those solutions that do not evolve quickly will not be in the top 10. And, big data analytics integration will be one of the primary evolutionary needs in the upcoming years.