To Properly Navigate a Data Lake, You Must Chart Your Route


Share on LinkedIn

Pentaho CTO James Dixon is credited with coining the phrase “data lake.” He stated, “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” The concept of building a Hadoop-based data store and dumping all raw data onto the platform is not a new one, but many ships have been lost at sea thinking they were navigating a lake.

There are two perils you must avoid when deploying a data lake to ensure you can navigate safely to your destination. The first scenario is referred to as the “data swamp,” an environment that is muddied and dangerous. From a data perspective, this can be best described as not having the ability to know what data is what and how to use it. For example, it is very common within the lending industry to have multiple operational systems with what appears to be the same data, but they really represent the state of data at a given point of time. This ambiguity may lead to false results within analysis, causing bad business decisions.

The second peril is being lost at sea in an environment that is so vast and broad you can’t navigate the data to gain insight. It takes a much more highly skilled resource to even identify what data is available, and then to actually navigate requires an even greater technical skill. Data lakes in general are flawed from this perspective and lead to little business value unless a data strategy is employed.

I spend a significant amount of time working with my clients in defining a data strategy based on a traditional definition of a data lake; but going into it with a plan and a map makes their journey both safe and productive. With a few guiding principles, we can easily jumpstart analytic capabilities, provide a foundation for future structured analytics, and finally see real improvement in “time to value” of the data.

1. Inventory the data – Just putting data onto Hadoop generally adds little value. Having an inventory of data, where it came from, and the frequency of the data will help guide user to the appropriate source of raw information.
2. Profile the data – Going back to the basics for data exploration, summarized data profiles many times provide more insight than vast amounts of raw data and allow users to understand what they are looking at before they dig into the details.
3. Collect metadata – Both technical and business metadata is generally available on both structured and unstructured data. Simply collecting this information using HCatalog to create a means for storage and providing a familiar SQL interface for users can make a world of difference.
4. Provide a framework that integrates – Once you have inventoried, profiled, and collected metadata, allowing end users a single interface that allows them to explore both the collected information as well as the raw data may prove to be the most valuable of data tools and truly unlock the power of the data lake.

In summary, by avoiding some typical pitfalls we can elude the swamp and the sea. A little structure in an environment that is unstructured will rapidly allow you to gain insight into your data, provide a means for new analytic capabilities and ultimately impact your business in a way that hits the bottom line.

Jay Houghton
Jay Houghton, Senior Vice President, Technology Solutions GroupJay leads Merkle's Financial Services and Nonprofit Technology Services Group and has more than 20 years of information technology and banking industry experience. Merkle's Technology Services Group focuses on best-in-industry implementations of marketing analytic platforms, providing a foundational approach to marketing data management and analytics.


  1. I agree that you need to have a strategy and best practices in order to gain the most value from a data lake.

    But I don’t agree that data lakes are generally flawed because they are hard to navigate because they hold so much data. The whole point of data lakes is the amount of data they store. It’s like say large hard drives are flawed because they can store a lot of files. Any lack of organization is due to the user, not the data store itself.

    James Dixon

  2. I posted my response. You are absolutely correct. I really like your thoughts. Here is what I posted “Great article, and I agree. This is same mistake that organization did when they started building the DWH; “we will build and they will come”. Now for data lake, “we will dump the data and they will use (explore)”. Without thinking, just creating data lakes will drown them. No matter what tools they use. Very well explained in the article. Thanks.”

    Creating Data Lakes without any of the steps that you mentioned; and using new tools to show colorful reports should not be the goal. The basic and fundamentals of Data is avoided. This can cause “dirty” Data Lakes.



Please use comments to add value to the discussion. Maximum one link to an educational blog post or article. We will NOT PUBLISH brief comments like "good post," comments that mainly promote links, or comments with links to companies, products, or services.

Please enter your comment!
Please enter your name here