Top

Of Pigs, Elephants & Hives – Social CRM and Big Data

By on May 31, 2011 Editor's Pick No Comments

After a long gap I make a comeback into pure play technobabble. But I promise I will try to keep it as simple as possible.

In a post titled “Social CRM – ETL of Social & CRM data?” I had posited over two years ago that semantic, sentiment analysis & other NLP techniques and data mapping would all be required to varying degrees for data integration in a Social CRM system. What I had missed mentioning back then was what is now being hyped as Big Data

Social CRM from an IT perspective needs to deal with lots of data (petabytes) because it has to deal with data coming in from social media, online community platforms, various new enterprise systems (especially given the rise of enterprise 2.0, social collaboration, activity streams and what not where employees are creating a lot of data too). And not to forget traditional data. As my friend Esteban Kolsky likes to remind me everytime, Sears had a terabyte of customer data a decade back and apparently did not know what to do with it.

Another aspect I missed out stating back then was that data is typically collected, prepared and then finally presented and that ETL is just the preparation part of it.
  • Data in social media can be collected in various ways – RSS feeds, APIs (Twitter & Facebook firehoses) or just plain scrapping them from the various sites by crawling & spidering. Radian6 and Attensity360 do a great job of data collection. Both tools provide realtime capabilities for response/action however very little is analysed at this stage in these tools. Thus a lot depends upon the ability of the end users of these tools when responding in realtime is concerned.
  • Data preparation or ETL typically deals with bringing in the raw data and loading, cleaning and conforming it to selected data model; joining with other data sources and producing data sets ready for data users to consume. Data Preparation is not so realtime because of the huge amounts of data involved, but in-memory analytics seems to be getting hotter by the day. Recently SAP integrated its HANA (High Performace Analytic Appliance) with IBM’s DB2 database increasing its alignment against rival Oracle’s Exadata platform.
  • Data presentation is typically taken care of by data warehouses in IT architecture. Here the data is presented to the consuming applications like CRM/BPM systems or BI reports. BI reports can be realtime or not, depending upon the amount of analysis that takes place. CRM/BPM systems route the massaged data into the various business processes and is acted upon by pre-configured rules or by humans manning these systems (ACM, Social BPM can be leveraged here).
So what do all these things that I forgot in my long lost post have to do with the title of this post? It has to do with open source actually (yes, I had bring them in). :)
Apache Hadoop is a well known (in IT, especially distributed & cloud computing, and open source circles) software framework that supports data intensive distributed applications built using the Java language. And Hadoop is the name of the toy elephant of the creator’s (Doug Cutting’s) son. Apache Pig is a platform for analyzing large datasets. And Apache Hive is a data warehouse infrastructure built on top of Hadoop.
And here is a project that uses Hadoop, Pig and NLP for mining Wikipedia: http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html. Very geeky and cool, if you share my likes. Now to see if this can be leveraged for social CRM by someone. :)
Print Friendly

Republished with author's permission from original post.

262093

Categories: ! Blog! Editor's PicksEnterprise TechnologySocial Business
111 views

No comments yet.

Leave a Reply