The math behind why AI agents keep failing

AI agents are supposed to handle your work while you sleep. The problem is they can barely handle it while you watch.

Close to three-quarters of companies plan to deploy agentic AI within two years, according to Deloitte's latest State of AI report. But only 25% have managed to move even 40% of their AI pilots into full production. The gap between what companies want agents to do and what agents can actually pull off has a surprisingly straightforward explanation: basic multiplication.

When you ask AI to do one thing, like summarize a document or draft an email, it works well. Single-step tasks succeed 90% of the time or better. But agents aren't built to do one thing. They're built to chain tasks together: read this spreadsheet, spot the trends, build a presentation, send it to your team.

Each step in that chain carries its own failure rate, and those rates compound. If every step works 90% of the time, a five-step workflow succeeds about 59% of the time. A ten-step workflow? Roughly 35%. The longer the chain, the more likely something breaks.

This shows up clearly in practice. A developer using Anthropic's Claude Dispatch reported a roughly 50% success rate on anything beyond simple operations. Basic file tasks worked fine. The moment you asked for a real multi-step workflow, like analyzing a CSV, pulling out trends and building a presentation around them, it was a coin flip. Benchmarks on real-world industrial tasks found similar results: no model broke 70% on single tasks, and when researchers added a planning step where the AI maps out what to do before doing it, success rates dropped from 65% to 38% for the best performer.

What makes this harder to fix is that raw accuracy isn't even the right thing to measure. Researcher Guy Freeman ran a comparison between a standard AI agent and a simpler system designed to understand the cost of being wrong:

The standard agent got more answers right at 63.7% accuracy but scored negative points overall because it confidently charged through every question, including ones it should have skipped.
The simpler system had lower raw accuracy at 59.6% but outscored the standard agent by 120 points because it knew when to stop.

Freeman then tried fixing the standard agent with better prompting, telling it to be more cost-conscious and careful. It scored even worse. "The word 'agent' in 'agentic AI' is doing an enormous amount of work whilst meaning almost nothing," he wrote. "What it actually describes is a language model with a for-loop."

None of this is slowing companies down. Deloitte found that 85% of organizations plan to customize agents for their specific needs, but only 21% have mature governance for how those agents should operate. And 37% are still using AI at a surface level with no real changes to their underlying workflows, essentially bolting agents onto processes that weren't designed for them.

Companies that do go all-in are seeing results. A Grant Thornton survey found that organizations with fully integrated AI were nearly four times as likely to report revenue growth compared to those still running pilots, 58% versus 15%. But those organizations are far from the norm. Most are stuck between deploying agents and getting them to actually deliver, and the compounding math of multi-step failure is the wall in between.

The agent pitch has always been built on demos where every step works perfectly in sequence. Real work is not a demo. The companies that get the most out of agents in the near term won't be the ones handing over ten-step workflows and hoping for the best. They'll be the ones who understand that 90% reliability per step is a coin flip by step ten, and build their processes around that reality. Until the models get meaningfully better at sustained, autonomous execution, the smartest play is a shorter leash, not a longer one.

AI agents are supposed to handle your work while you sleep. The problem is they can barely handle it while you watch.

The standard agent got more answers right at 63.7% accuracy but scored negative points overall because it confidently charged through every question, including ones it should have skipped.
The simpler system had lower raw accuracy at 59.6% but outscored the standard agent by 120 points because it knew when to stop.

The math behind why AI agents keep failing

Enjoyed this article?

More Articles

Bezos bets $10 billion on physical AI

OpenAI's $2 billion a month still isn't enough

Apple and Meta are making opposite bets on AI glasses

Bezos bets $10 billion on physical AI

OpenAI's $2 billion a month still isn't enough

Apple and Meta are making opposite bets on AI glasses

The math behind why AI agents keep failing

Enjoyed this article?

More Articles

Bezos bets $10 billion on physical AI

OpenAI's $2 billion a month still isn't enough

Apple and Meta are making opposite bets on AI glasses

Bezos bets $10 billion on physical AI

OpenAI's $2 billion a month still isn't enough

Apple and Meta are making opposite bets on AI glasses