As many of you may know, Big Data has added a 4th ‘V’ to it’s definition – the concept of Veracity. Oh wait, apparently a 5th and 6th ‘V’ have been added – Value and Viability. A 7th? Not yet? It’s only a matter of time. I nominate vaticination.
Since Veracity was suggested by IBM, I’ll focus on this one for now. Veracity is meant to represent the fact that Big Data contains a lot of noise and error that can easily obscure the ‘truth’ that exists within the data. The data tend to be so big and move so quickly that standard data cleaning and management procedures are difficult to apply. As a result, the trustworthiness of Big Data is an important fact to consider and to question.
However, we have to recognize that Veracity cannot be singularly defined for a given set of data. It depends on the question the data is answering and the purpose for which the data is being collected. For instance, let’s say I have a massive dataset of all the local deliveries made by a construction supply company. This dataset contains information on every order over the last year, and includes the products in each order, the customer who ordered it, the site it was delivered to, and whether the delivery was on-time. This data could have high veracity when answering a question about problem areas or problem products for on-time delivery. However, it will have lower veracity when answering a question about the next product a customer is going to order. This particular dataset will be missing information about will-call orders, returns, and detailed customer demographics – all things that would be important for creating a next-best offer model.
In my opinion, we spend too much time trying to define Big Data, and not enough time figuring out what data and analytic approaches are necessary to answer key business questions. The discussion and scoping of Big Data is an important discussion to have. Harnessing and understanding Big Data will lead to new questions and new answers for the world, but there are still plenty of big questions that need to be answered in the meantime using whatever size and type of data we have available.