Studies Uncover Challenges in Agentic AI Delivering Business Value

August 3, 2025

4 Comments

All you need to improve your business operations is agentic AI. Or so they say these days.

Just that it isn’t true. As usual, business is a bit more nuanced than that, and we still see a considerable number of AI implementations fail to generate revenue. A widely cited number, allegedly attributed to MIT Sloan, says that between 70 and 85 percent of all AI projects fail to deliver value. While there may not be one MIT Sloan study that gives this result, several reports dating back to 2019 by MIT or Gartner cite a significant failure rate of AI projects.

Interestingly, many recent publications on the success or failure of AI initiatives still – in 2025 – cite these almost ancient sources.

With all the rage and hype that, in particular, Salesforce managed to generate about agentic AI and its potential to become a new digital employee, we need to ask ourselves whether the technology can already deliver. Or, if it delivers, in how far it delivers.

In my first column of this year on how to identify when agentic AI is helpful or not, I laid out a framework that helps in identifying which problems to address using agentic AI and which ones to avoid. The framework revolves around problem complexity, using the Cynefin framework. I resolved to addressing business problems that are underpinned by rich data, evolving conditions, and, importantly, some error tolerance. While I stand by it, it is not exactly a scientific approach.

Since then, my ongoing research has uncovered two scientific studies on Arxiv, one conducted by CMU and one by Salesforce Research. These studies are two credible sources that provide a more nuanced picture on where agentic AI works or not than vendor success stories that by definition focus on successful implementations.

But, is it all success? Can an agentic AI implementation deliver value to the customer, and if so, under which circumstances? Answering these questions is what makes these studies valuable. They dig right into the thick of it using simulations of standardized business scenarios.

The studies

Both studies are designed to evaluate the capabilities of LLM-based AI agents in business environments. CRMArena-Pro by Salesforce Research naturally has a focus on CRM tasks across B2B and B2C scenarios. The authors identified nineteen tasks commonly executed in CRM systems and categorized these tasks in the four business skill categories:

database querying and numerical computation
information retrieval and reasoning
workflow execution
policy compliance and includes a confidentiality awareness evaluation

TheAgentCompany, on one hand, covers a wider area along the business value chain but has a narrower focus on software engineering companies. Another significant difference between these two studies is that TheAgentCompany focuses more on complex tasks that require multiple steps for their execution. The authors created the evaluated tasks through a combination of referencing the O*NET task list, introspection based on paper co-authors who had experience in each task category, and brainstorming lists with language models. Tasks have been chosen to simulate everyday workplaces within a software development company, spanning software development, project management, data science, administration, HR, and finance.

CRMArena-Pro differentiates between and analyses single-turn and multi-turn interactions, while TheAgentCompany doesn’t differentiate between single-turn and multi-turn interactions.

What are single- and multi-turn interactions?

Single-turn interactions are tasks that are generally restricted to a single query and response interaction. The LLM receives all necessary information in the initial prompt and is expected to provide a complete answer or execute a task without requiring further dialogue or clarification from the user. Examples of single-turn interactions include asking a voice assistant, “What’s the weather today?” or requesting a chatbot to “Translate this sentence to French.”

In contrast, multi-turn interactions are interactions that involve a series of back-and-forth exchanges between a user and the LLM agent to complete a task. The system maintains context and memory throughout the conversation, thus enabling more natural, human-like dialogues where the AI can understand references to previous parts of the conversation and build upon earlier exchanges. An example of a multi-turn interaction might begin with a user asking “Book me a flight,” followed by the AI asking clarifying questions about dates, destinations, and preferences, ultimately completing a complex booking process through sustained dialogue.

Multi-turn interactions more closely mirror real-world applications, ranging from sustained dialogues to complex iterative problem-solving. They inherently demand an understanding of conversational history, nuanced interpretation of previous exchanges, iterative refinement of goals, and adaptive response strategies that are required in business settings.

The results

While both studies used different but overlapping sets of LLMs, the results are quite consistent.

To summarize, LLM-based agents do deliver, but only partly. I do not think that this comes as a surprise.

Both studies show that modern, reasoning models on average perform better than non-reasoning models. Notably, Gemini 2.5 Pro seems to be the best model in both studies. In CRMArena-Pro, OpenAI’s o1 model is a close runner-up, while in TheAgentCompany Cloud 3.7 Sonnet scores rank two. TheAgentCompany did not test OpenAI o1. As both studies show GPT4.0 being significantly weaker than Gemini 2.5 Pro, I assume consistent results across the studies.

CRMArena-Pro shows that with the use of leading models (Gemini 2.5 Pro), agentic AI does solve only slightly more than half of the problems thrown at them in a single-turn scenario. They excel at workflow execution, where the leading models correctly solve 80 percent or more of the tasks. They also show their weak spot to be information retrieval and textual reasoning, which requires them to reason over unstructured or semi-structured textual information.

Both studies show that LLM-based agents have challenges delivering correct results in multi-turn tasks. To put it bluntly, even the best-performing models (GPT 2.5 Pro, Claude Sonnet 3.7, OpenAI o1) are failing more often than not, even if one includes partial success. On a binary scale, the highest success rate of LLM-based agents hovers around 30 percent, or the other way round, fail in 70 percent of all tasks. Again, the strengths lie in workflow automation and, in a business scenario sense, in software development, project management, and HR tasks.

And these are the agents that are based on the best tested models, which are Gemini 2.5 Pro, Claude 3.7 Sonnet, and OpenAI o1.

What does this mean for executives?

On a high level a Cynefin framework decision process works. The additional information that we can derive from these studies is that it is extremely helpful to start where there is a good expectation for success, i.e., in single-turn scenarios. This helps gain the necessary experience in building and deploying agents that work. Also, to see early successes, it is important to start with business skills that are promising. And the most promising is workflow automation of – probably improved – processes. Identify those that are critical enough to pursue and that don’t need multiple user interactions.

Identify relevant scenarios that require different business skills and prioritize them. From there on, select an agentic platform that can support all of these scenarios, and that is future-proof. By future-proof, I mean that the platform should have client and server-side model context protocol (MCP) capabilities and have agent-to-agent (A2A) capabilities not too far away on their roadmap.

An additional, yet implicit, result of the studies is that data quality matters. Both studies rely on carefully prepared bodies of data. Other studies have shown that the impact of poor data on any (not only agentic) AI initiative is huge. Decision makers should keep the proverb “rubbish in – rubbish out” in the front of our minds before engaging in an agentic AI initiative and start where there is enough, and good enough, data available for automating processes, or even with a data quality project.

If you need help with this journey, just ask me.

4 COMMENTS

Ricardo Saltz Gulko August 3, 2025 At 4:33 am
Great point, Thomas. We’re seeing similar patterns in our work with enterprise clients within the Samsung world related to CX. Agentic AI shows real promise, especially in workflow execution as in initial responses—but the reality is far from the hype, there are many challenges still. In fact, my upcoming article next week dives into exactly this: where these agents succeed, where they struggle, and how data quality and scenarios as need analyses for AI make or break value creation. Looking forward to continuing this important conversation. Great read! Thanks so much R
Trent Rossini August 4, 2025 At 10:13 am
Very good article Thomas – the point about quality data is key. LLM models excel when they have context and thus should be best applied in situations of appropriate context. Here is a link to an article I wrote in Linkedin on the subject of using Agents in the context of Customer Experience Management
https://www.linkedin.com/pulse/how-guide-customer-journeys-agentic-ai-trent-rossini-h8qef/
Thomas Wieberneit August 4, 2025 At 10:18 am
thanks Ricardo, am looking forward to your article.
After all, it boils down to statistics. if an individual agent has an accuracy of 95%, then the aggregate accuracy goes down with each single step. There is some fascinating research going. in in the realm of RAC (retrieval augmented correction).
Thomas Wieberneit August 5, 2025 At 10:53 am
Thanks Trent. Yes, data is important. We should also not forget that we will never have really clean data, so the ability to work with imprecision is important, too. As is a level of error tolerance that of course varies widely depending on the use case.