The Critical Role of Enterprise Data in Generative AI


Share on LinkedIn

Source: iStock

A huge number of Gen AI-based tools and applications have flooded the market. Some of these applications are clever and creative but they are mostly wrappers for the large language models (LLMs) behind applications such as ChatGPT. That is not to detract from the thousands of potential scenarios where an LLM’s knowledge of language, concept, and word relationships provides new efficiencies and productivity. 

But LLMs and ChatGPT cannot solve all of information the problems of the organization. Machine learning is at the core of AI applications. Conventional tools in the enterprise tech stack (such as ERP, data warehouses, eCommerce, and content/knowledge management) are increasingly incorporating machine learning into core functionality. 

These systems need the organization’s data — not just the data from a generative model — to provide answers that are precise and relevant. Furthermore, generative models need the organization’s data to provide differentiation. Using the same model that your competitor uses will not help you compete (unless your unique application is very creative — but if a competitor is using the same tooling that advantage will not last — there will be copycats instantly). LLMs also have quirks and challenges that need to be addressed to ensure that the technology correctly serves the employee or customer. They are prone to hallucinations (factually incorrect responses), a lack of audit trails, and potential leakage of IP.

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) overcomes these limitations. This approach uses the data of the enterprise as a source of truth. Rather than relying on the LLM’s knowledge of the world, it interprets the user’s query, retrieves information through one of several mechanisms, and makes the answer understandable and conversational for humans. But it requires the organization’s information as a reference point — the source of enterprise truth, which is the source of competitive differentiation.

What does this information look like? It begins with customer, transaction, and product data and content as well as the knowledge and expertise unique to the target market and core competencies of the enterprise. How do you solve the problems that your customer has? How do you reach your target customers with products and services? How do your supplier relationships and supply chain understanding help you beat the competition? 

These are all sources of competitive differentiation. Each one is managed by understanding data, knowledge, and content flows from supplier through manufacturing and distribution to the end customer and includes understanding the best ways to reach them and what marketing content to use.

It might seem that Gen AI could be a big help in creating marketing content. But how is your Gen AI marketing content different from someone else’s? Better prompts? More creative questions? Contextual information? There still needs to be a human and creative element. Machines can automate but humans are still needed to connect. We cannot outsource our human abilities to machines.

While machine learning and Gen AI tools can automate many of the routine and rote activities of humans, there are still orders to fill and products to organize in a catalog. Users need to search for products of interest. They need to learn about, choose, purchase, use, and maintain the product or solution from your organization. 

Each of these functions requires a repository of data, and data requires a certain structure. The core structure of data in the enterprise has been referred to as master data. There are various flavors of master data — customer product, financial, transaction, and content. Many different tools are on the market to address particular use cases for the domain. 

Master Data misses nuances

But master data alone misses much of the nuance and value of data. True insights that can be derived and applied by understanding how one piece of information relates to another. A customer identity graph is a data representation that illustrates the relationships between various attributes such as customer type, interests, past purchases, buying intent, and more. (IMDB is an example of a graph data structure. It contains movies, actors, and directors and for example, one can look at a movie, find the director, and then look at other movies they directed and what actors starred in them. Choose an actor and the system can provide all of their films). Those are graph relationships. A customer identity graph can help an e-commerce application present the most relevant products for that customer. This comes from the customer details that are captured as the data exhaust at each touch point throughout the customer journey.

Touchpoints leave a data trail

Each customer touchpoint is enabled by various customer experience technologies and each captures customer details in a data model – the descriptors of customers: demographics, firmographics, market segment, technical literacy, purchased products, and many more details. These descriptors are called features in machine learning — they can also be called “attributes” — the metadata that describes characteristics of a prospect or customer. Who are they and what do we know about them? What size organization do they work for? What is their role or position? What are their interests? How technically proficient are they? What objectives are they trying to achieve? What is their overall remit to the enterprise?

Using RAG to reduce hallucinations

According to SAP blogger Abhijeet Kankani, “RAG significantly expands the scope of Large Language Models (LLMs) in enterprise settings. Typically, while LLMs excel in text creation, they cannot pull in specific, detailed data from company databases. RAG addresses this by retrieving the necessary information to ensure AI-generated responses are both relevant and factually accurate.”[1].

A post by Kevin Wei from Writers Room describes the need to process information for RAG as one would organize information in a library: “Information preprocessing is analogous to cataloging in real-world libraries. This involves organizing information into categories and assigning keywords to each piece of information for easier retrieval and identification. This process helps to make data more accessible and easier to search for and understand.”[2]

Building libraries for reference

He further describes contextualization as “Storing the organized data in a vector database or a suitable location, setting the stage for seamless integration into the text generation process. It involves creating a hierarchical structure based on relevant keywords or terms, which can then be used to locate relevant documents or texts quickly and easily. This can be compared to the process of shelving books in a library according to relevant topics or genres, which helps patrons quickly find the materials they need.”

The cataloging system has a customer identity graph as the foundation (along with other graph data structures). Enabling LLMs to get the right answer for the customer or employee means retrieving information based on their context. That context is from the digital body language that they throw off during their journey. LLMs are amazing algorithms. But they need enterprise data to provide true utility and competitive advantage. This requires modeling those data, content, and customer catalogs so the right information can be retrieved by the LLM rather than leaving it to its own, potentially hallucinatory, devices. Getting the answers to these questions depends on having properly curated and structured enterprise information available.






Please use comments to add value to the discussion. Maximum one link to an educational blog post or article. We will NOT PUBLISH brief comments like "good post," comments that mainly promote links, or comments with links to companies, products, or services.

Please enter your comment!
Please enter your name here