Lies and big data


Share on LinkedIn

Big data and lies it tells usThe other week someone brought to my attention an article with a title “Lies Data Tell Us” by Steven J. Thompson, CEO at Johns Hopkins Medicine International. The title took me aback, but as I read it I realized the article was really about better practices required for data to be more useful. Use of the provocative and somewhat misleading title resulted in nearly 12K views, dozens of comments and hundreds of shares in social media. When I started looking for this article again, the search brought a number of links that associate data, big data, etc. with “lies”. Most of the authors blame data or unscrupulous mining and analysis technology vendors for all sort of business problems resulted from “data lies”.  It seems some of these authors use the following definition:

 Data Scientist (n): A machine for turning data you don’t have into infographics you don’t care about.

I would like to examine a process people often follow when they deal with data.

Since the term “big data” is thrown around a lot, I would like to define it in the context of this article. Mere volume and velocity of data does not constitute “big data”, but multiplicity of data sources and data formats does. From that perspective the term “big data” describes an enterprise data aggregated from multiple departments and multiple data bases (i.e. data warehouse model), linked with data from sources external to a company, in a structured and/or unstructured format. Mining such set of “right data” may produce very valuable intelligence. However, all can also result in waste of money, efforts and opportunities if

  • The mining process does not produce relevant new intelligence, or
  • The intelligence is not used for action.

We act when we believe the action will result in a desirable outcome. We never know for sure, but we estimate probability based on our experiences in similar circumstances. These dynamics influence how we select, search and interpret the data into intelligence, or lack of thereof. Subconsciously we select data that would likely provide confirmation of our existing beliefs. This usually means that we heavily rely on internally generated (controlled) data  and heavily discount externally generated data. 

We like to use such terms as unbiased and objective, but the very process of selecting a data set introduces bias and subjectivity. It is unavoidable. It is a much better practice to embrace and understand a bias that is pragmatic, and define a purpose of an inquiry. You don’t see people mining a mountain to find “whatever” is there. They carefully select and test an area for an indication of high concentration of desired mineral before the exploration and mining start.

If the purpose of your inquiry is improvement of customer experience, assemble a data set from the most relevant internal and external data sources available. If you limit your data set to a company controlled data, you introduce a company bias.  In such a case the likelihood of discovering any new intelligence for improving your customers experience is quite low. Forget about data mining and just continue your archaic surveying exercises of “guess and validate”. If you include data generated by customers without solicitation and control, you will introduce customer bias. Introduction of channel generated return data and customer service data will allow for balancing of the biases. Correlation of trends in controlled and external data sources will help to discover potential gaps between your beliefs and emerging evidence. However, even the best evidence cannot automatically make people abandon their beliefs and start acting differently, but that is a subject of another article.

The point is – data cannot lie to us; we have to do it ourselves by not mining it honestly and competently.

Republished with author's permission from original post.

Gregory Yankelovich
Gregory Yankelovich is a Technologist who is agnostic to technology, but "religious" about Customer Experience and ROI. He has solid experience delivering high ROI projects with a focus on both Profitability AND Customer Experience improvements, as one without another does not support long-term business growth. Gregory currently serves as co-founder of, the software (SaaS) used by traditional retailers and CPG brand builders to create Customer Experiences that raise traffic in stores and boost sales per customer visit.


  1. Completely agree with your perspective, and your contextual definition. In order for sources and streams of information to be identified as ‘big intelligence’ rather than ‘big data’, there has to be action-driven objectivity in its analysis and application.

    One of the several ways in which data can lie is to use analytical tools of omission, rather than commission, i.e. slavishly looking for connections and correlations from the various databases rather than for causation. There’s a big difference in the intelligence and insight this produces:

  2. Michael, thank you for your comment. Finding causations while examining open systems is a very difficult, and some would say impossible, proposition. More than one statistician have told me that “causation is an ideological term”. Like in many debates, the key to resolution lay in a very tight definition of “the truth”. Even gravity works only under limited conditions. I hope you would agree that these limitations should not stop us from using correlations to build models, IMHO. I addressed this issue in


Please use comments to add value to the discussion. Maximum one link to an educational blog post or article. We will NOT PUBLISH brief comments like "good post," comments that mainly promote links, or comments with links to companies, products, or services.

Please enter your comment!
Please enter your name here