Talk of federal budget cuts have not affected the big data evolution. In early 2012 the Obama administration announced their big data initiative with plans to spend hundreds of millions of dollars on big data R&D. As part of that plan, the Department of Defense created the XDATA program, a $100million effort with a focus on accelerating big data and big data insight technologies.
The XDATA program has started to award development projects and one to keep your eye on is Blaze. Before going into details on what the Blaze project is, and what it plans to accomplish, let’s look at the typical big data analytics stack which consists of 3 primary components:
- Database or distributed data environment: This is where the big data is stored and processed. Great open source and vendor-side solutions exist including Teradata, Hadoop, Netezza & MongoDB.
- Analytic Data Processing Environment: In most environments, a separate analytic data environment is needed. Data may be aggregated in the database environment, to reduce its size, but then more advanced analytic data transformations are required on a reduced set of data, however that reduced data is still large. This requires solutions that can transform and scrub data, using advanced analytic techniques, quickly on large data. Processing large data, using analytic techniques, on a server or clusters or servers, is where vendor supplied solutions such as SAS and Revolution R Enterprise excel but no open source solution it at that level yet in terms of ease of use.
- The Analytic Solution: Plenty of good open source and vendor-side analytic solutions exist to build the analytics / predictive models. Those include R, SAS, Revolution Analytics, & RapidMiner (many more are available).
Now, if you prefer to go the open source route, like the XDATA program does, there is one glaring omission in the big data analytics stack and that is in the analytic data processing environment (#2 above). Enter Blaze, a $3Million dollar open source development project. The Blaze project will extend on Python’s robust analytic and data libraries and make some of Python’s libraries big data friendly.
Python currently processes most everything in RAM / memory and is inefficient at best when processing big data. Some of Python’s libraries show nice strides in this area, such as Pytables, but the promise of Blaze is to make that analytic data processing, on data larger than available RAM (called out-of-core processing) blazing fast, while just as importantly making it easy for analysts to use.
I’ll conclude with a few additional reasons I am excited about Blaze-
- Blaze will continue to put pressure on analytic vendors to evolve quickly so if you prefer to go the vendor route you should still feel the evolutionary benefits.
- Blaze is Python-based and Python integrates very well with both MongoDB (an open source distributed data environment) and R (an open source analytic solution). The solution stack of MongoDB + Python + R is powerful now (without Blaze) and will be enhanced with Blaze.
- Hopefully this project will continue to shed light on all the wonderful open source big data analytics work others have sacrificed to build, without funding from the federal government, and thus evolving all our capabilities.