AI Agents Failing (40% Cancellations Predicted)

New research has found that 70 per cent of AI agents struggle to complete standard office tasks successfully, while Gartner warns that over 40 per cent of current agentic AI projects will be scrapped by the end of 2027.
What Are ‘AI Agents’ And Why Are They Struggling?
AI agents are software systems that use large language models (LLMs), like ChatGPT or Claude, in combination with tools and applications to carry out goal-driven tasks without constant human input. Unlike chatbots or virtual assistants that only provide responses, agentic AI is designed to take actions, such as navigating software, interacting with web content, or managing emails, based on natural language instructions.
Examples include agents that can generate reports, schedule meetings, or execute multi-step operations such as processing CRM queries or managing code deployments. The idea behind them is that AI can behave like a semi-autonomous digital worker, thereby improving speed and efficiency while reducing costs. However, recent evidence suggests the reality falls far short of the promise.
For example, in a landmark study by researchers at Carnegie Mellon University (CMU), most of today’s leading AI agents were only able to complete around 30–35 per cent of assigned office tasks. That means they failed nearly 70 per cent of the time.
Testing Real-World Tasks
To evaluate how AI agents perform in realistic workplace scenarios, the CMU team created TheAgentCompany, a simulated IT company environment designed to mimic tasks that real employees might encounter. These included browsing the web, writing and editing code, interpreting spreadsheets, drafting performance reviews, and messaging colleagues on internal comms tools like RocketChat.
Results Not Good
Researchers tested agents based on how many tasks they could complete fully and accurately. Top-scoring models included Gemini 2.5 Pro, which managed a 30.3 per cent success rate, and Claude 3.7 Sonnet, which achieved 26.3 per cent. Other well-known models fared worse. GPT-4o completed just 8.6 per cent of tasks, while some large-scale models like Amazon-Nova-Pro and Qwen-2 scored under 2 per cent.
Variation and Serious Slip-Ups
“We find in experiments that the best-performing model…was able to autonomously perform 30.3 per cent of the provided tests to completion,” the CMU team noted. Even with extra credit for partial progress, most agents still fell short of reliable performance.
Also, it looks as though the failures weren’t just minor slip-ups. For example, in some cases, agents forgot to message colleagues, froze while interacting with pop-ups, or even faked task completion, such as renaming users to make it seem like they’d contacted the correct person.
Salesforce’s Findings Echo the Concerns
A separate study by Salesforce offered similarly sobering results. In their CRM-focused benchmark CRMArena-Pro, LLM agents completed about 58 per cent of simple, single-turn customer service tasks. However, in multi-step scenarios where context had to be maintained, success rates dropped sharply to around 35 per cent. None of the evaluated agents demonstrated any meaningful understanding of confidentiality—an essential requirement for deployment in enterprise settings.
The researchers concluded: “LLM agents are generally not well-equipped with many of the skills essential for complex work tasks.”
Over 40 Per Cent of Projects Will Be Cancelled by 2027 …
Industry analysts at Gartner believe this isn’t just a technical hiccup, but could be an indicator of wider strategic risk. For example, the firm predicts that more than 40 per cent of all agentic AI projects will actually be cancelled by the end of 2027. Their assessment is based on the three key drivers of spiralling costs, unclear business value, and inadequate risk controls.
“Most agentic AI projects right now are early-stage experiments or proofs of concept that are mostly driven by hype and are often misapplied,” said Anushree Verma, Senior Director Analyst at Gartner. “This can blind organisations to the real cost and complexity of deploying AI agents at scale.”
A January 2025 Gartner poll of more than 3,400 business respondents revealed that while 19 per cent had already made significant investments in agentic AI, another 42 per cent were only dipping a toe in. Around a third were still waiting to see how the technology matures before committing.
What’s Going Wrong?
A key issue appears to be the fact that many supposed “AI agents” aren’t really agentic at all. For example, Gartner has criticised the growing trend of ‘agent washing’, where vendors rebrand chatbots, rule-based automation tools, or even basic assistants as ‘agents’ to ride the hype wave. Of the thousands of companies claiming to offer agentic AI products, Gartner estimates that only around 130 genuinely qualify.
Even for the legitimate players, it seems that technical challenges abound. For example, CMU’s team highlighted the following major limitations:
– Common-sense reasoning failures. AI agents often misinterpret basic instructions or misunderstand context. This limits their ability to carry out even straightforward workplace tasks.
– Poor tool integration. Many agents struggle to operate reliably within software interfaces. They may freeze, click the wrong buttons, or fail to retrieve the right data.
– Fabricated outputs. Hallucination remains a major problem. Agents sometimes invent plausible-sounding but incorrect responses. Among developers, 75 per cent report experiencing hallucinated functions or APIs.
– High cost and inefficiency. Despite being pitched as labour-saving, one study estimated that a typical AI agent task involved around 30 steps and cost over $6, which is often more than it would take a person to do manually.
– Security and privacy risks. Because agents need wide-ranging system permissions, there’s a serious risk they could accidentally expose sensitive data, or act unpredictably in ways that breach confidentiality.
Complexity and Context
While some agent frameworks are improving, it seems that the wider problem is that many office tasks require not just automation, but judgement. For example, Graham Neubig, a co-author of the CMU paper, explained that while coding agents can be sandboxed to limit risk, office agents must interact with live systems, sensitive messages, and human colleagues.
“It’s very easy to sandbox code…whereas, if an agent is processing emails on your company email server…it could send the email to the wrong people,” Neubig warned.
There’s also the issue of persistence. Multi-step tasks require agents to keep track of state, adapt based on outcomes, and respond to dynamic inputs. Even advanced models struggle to maintain context and consistency across more than a handful of steps, particularly when unexpected events, e.g. like a pop-up, error message, or missing file, intervene.
Buyers, and the Enterprise
For AI companies, the research findings appear to cast doubt on the maturity of the agentic AI market. Those selling genuine solutions will need to demonstrate clear, auditable performance, while others may face a credibility backlash if their products are exposed as agent-washed rebrands.
For enterprise buyers, the message is to proceed with caution. Agentic AI holds promise, but only for very specific use cases where outputs can be clearly defined, risks are manageable, and success is measurable. Without that, projects risk becoming costly distractions that never reach production.
Gartner suggests that businesses focus on agentic AI investments only where it can deliver proven ROI, e.g. by automating decisions, not just tasks, or by redesigning workflows to be agent-friendly from the ground up. “It’s about driving business value through cost, quality, speed and scale,” Verma explained.
Even so, Gartner remains optimistic that the agentic AI landscape will improve. By 2028, they predict that 15 per cent of all daily work decisions will be made autonomously by AI agents, up from none in 2024. They also expect 33 per cent of enterprise software applications to include agentic AI functionality by that time, suggesting that while short-term challenges are real, the long-term potential may still emerge.
What Does This Mean For Your Business?
The current hype around AI agents may be loud, but the reality behind the scenes appears to be proving far messier. Recent research shows that these systems still struggle with many of the core qualities needed for effective office automation, e.g., context awareness, reliability, consistency, and trust. While some agents show promise in structured environments like coding or CRM workflows, real-world office tasks often involve ambiguity, judgement, and unexpected challenges that most agents today simply can’t handle. This mismatch between marketing and capability is already fuelling disillusionment across the enterprise tech landscape.
For UK businesses, this could mean adopting a much more measured approach. For example, rather than rushing into large-scale AI rollouts, organisations may want to carefully assess whether agentic tools truly solve the problem at hand, and whether those benefits outweigh the risks and complexity. In industries where security, compliance, or client confidentiality are vital, agents that behave unpredictably or hallucinate outputs could introduce significant operational or reputational risk. Decision-makers will need to ask hard questions about vendor claims, demand transparency around performance, and avoid falling for superficial rebrands.
Also, for AI developers and solution providers, the pressure is now mounting to deliver genuine value and technical maturity. As Gartner’s forecast suggests, many agentic AI projects may be scrapped before they ever reach deployment. Rising costs, patchy results, and lack of clarity about return on investment are already stalling momentum. Yet amid this shakeout, there remains opportunity. Businesses still want tools that save time, reduce admin overhead, and support hybrid teams. If AI agents can evolve into reliable, well-integrated assistants that are grounded in workflows that make sense for users, they may yet become part of the fabric of enterprise software.
Until then, the safest path forward appears to be to treat agents as experimental copilots, not replacements. Hybrid approaches that combine AI capabilities with human oversight are likely to produce the most stable and trustworthy results. For now, it seems that the goal shouldn’t be full autonomy, but augmentation that helps people work smarter, and doesn’t automate them out of the loop.
Sponsored
Ready to find out more?
Drop us a line today for a free quote!