Knowing the software tools and analytic skills that have emerged to handle massive data volumes will help you navigate today’s technological landscape.
In only a few years, Big Data has moved from just Gigabytes to whopping Zettabytes! But what is behind the hype and what characteristics make regular old data become “Big Data”? What type of data is Big Data? And, given these characteristics of Big Data, how is useful information extracted from this tremendous growth in data?
Data (of any size) has characteristics:
Type of data – Structured, Numeric, Text, Unstructured, Pictures, Audio, Video
Size of data – Small, Medium, Big, Very Big, Humongous
Data state – Data in motion, Data at rest
The 5 Vs of Big Data:
The characteristics of Big Data are succinctly described using these concepts:
Volume: just moving data files of 10s of Gigabytes or more around will require non-traditional methods.
Velocity: Data streams are enormous, so network and processing speed is critical
Variety: There is no definite structure; data can be anything from audio & video to unstructured text
Veracity: if we hope to learn something from the data, it better be right; remember – “garbage in – garbage out”
Value: captures whether the data actually increases information content and therefore providing downstream inferred correlations
Big Data and the emerging Science of Data
Dealing with all these Vs of Big Data involves a wide mix of Technology Skills, summarized as:
Data Analytics, Warehousing, and Database engineering
Programming languages
Statistics, machine learning modeling, and algorithm testing/tuning
Data visualization
Those possessing these broad skill set are called Data Scientists. This new brand of scientists are discovering deeply hidden relationships from the constant streams of data produced every day.
Are you confused about which tools and platforms to use in order to get started as a Data Scientist or a Big Data Engineer?
Don’t worry, because here is a simple infographic that explains all you need to know about Big Data and Data Science together with tools and platforms.
Infographic brought to you by Digital Vidya
Data Scientist:
Technology skills:
o Analytics tools like Advanced Excel or/and
o Data Warehousing and SQL to do data query and filtering or/and
o Programming skills using R or Python or SAS
o Statistics Knowledge
o Statistical modelling and Machine Learning
o Tuning and testing machine learning algorithms
o Visualization using programming libraries
o Visualization using Business Intelligence (BI) tools
Domain skills:
o Data Science is about discovery and building information. Skills of Where, How and What from the Data for the given domain
o Skills to create motivating questions about the domain, and build hypotheses
Big Data – Data Engineer:
o Foundation Technology Platform: Apache Hadoop, HDFS, Map Reduce
o Databases: HBase or Cassandra or MongoDB or Apache CouchDB
o Hosting Platform Vendors: Cloudera, Hortonworks, AWS, Google Cloud Platform
o Data Engineer: Apache Hive, Apache Pig, Apache Sqoop, Apache Flume
Big Data Application Engineer + Data Scientist:
o Foundation Technology Platform: Apache Hadoop, HDFS, Map Reduce, HBase
o Databases: HBase / Cassandra / MongoDB / Apache CouchDB
o Hosting Platform Vendors: Cloudera / Hortonworks / AWS / Google Cloud Platform
o Application Engineer Platforms: Apache Spark / Apache Storm / Apache Flink
o Programming Languages (1 or 2 is good to know): Scala, Java, Python, R
o Statistical modelling, Machine Learning – using Apache Spark’s machine learning (Spark MLib) library or Apache Flink machine learning (FlinkML) library or H2O
o Graph Database, Graph Analytics
o Scaling up Machine Learning Algorithms
o Apache Mahout – Knowledge of premade algorithms for Scala + Apache Spark / H2O / Apache Flink