A team at Sherwin-Williams had a perfectly good AI system. Then Claude got smarter, and that's exactly what broke it.

The system did one thing well. Analysts and account managers could type a question in plain English, something like "pull sales volume for the Northeast from January through March," and it would turn that into an API call, fetch the data, and deliver a formatted report. By mid-2025, it was generating several hundred reports a month that went to leadership, external stakeholders, and just about every team pulling ad-hoc data.

Sarat Mahavratayajula, a senior software engineer at Sherwin-Williams, and Vijay Sagar Gullapalli, a founding AI engineer at Adopt AI, built the system on Claude Sonnet 3.5 in early 2025. They upgraded to 3.7 without a problem. Then to 4.0, also fine. "Model upgrades had become routine, like bumping a minor version of a well-behaved library," they wrote in a VentureBeat post this week.

Then they rolled out Sonnet 4.5.

For a meaningful chunk of requests, Claude started putting data where it didn't belong. The system expected API parameters in a specific field. Claude decided to fold them into a different one. So instead of pulling Northeast sales for Q1, the system would pull sales for every region for all time, or just error out.

Claude 4.5 also started asking clarifying questions, something earlier versions never did. If a request was ambiguous, 4.0 would take its best guess and return a structured response. The newer version, trying to be more helpful, would sometimes respond with "did you mean X or Y?" The system had no path for that. It expected a data object every single time, not a conversation.

The team rolled back to 4.0, but that was its own problem. Between deployments, they'd added new API integrations all tested against 4.5. "Reverting the model meant requalifying every one of them under time pressure," they wrote.

The model wasn't broken. By most measures, Claude 4.5 was an improvement. More thoughtful, more context-aware, more cautious with ambiguous inputs. Great if you're chatting with it at your desk. Not great if you've built an automated pipeline that depends on the model behaving exactly the way it did last month.

"The bug was not in the model," they wrote. "The bug was in our assumption that the model would continue to fill in our specification gaps as it always had. Three successful upgrades had trained us to believe those gaps were safe."

Sherwin-Williams isn't alone in discovering what happens when AI infrastructure shifts under you. Anthropic itself has published postmortems showing how a small routing error in its own systems cascaded to affect a much larger share of requests within hours. A separate April incident report documented quality regressions in Claude Code where output accuracy dropped even though the underlying model weights hadn't changed. One unnamed company accidentally spent $500 million on Claude in a single month after nobody set usage limits on employee licenses.

Anthropic has released a major model update roughly every three to five months since early 2025. Each one scores better on benchmarks. Each one also behaves differently in ways no changelog will fully capture, because the input space is natural language and the possible failure modes are basically infinite.

Plenty of companies are having the opposite experience. Notion's AI lead said the latest Opus model was a meaningful improvement, with fewer tokens and a third fewer tool errors. A distinguished engineer at Vercel said it "even does proofs on systems code before starting work, which is new behavior we haven't seen from earlier Claude models."

New behavior. Whether that's a feature or a production incident depends entirely on whether you built your system to expect it.

Into the Valley

The three successful upgrades were more dangerous than the one that failed, because they taught the team not to worry about upgrades. Every company running production workloads on a foundation model is accumulating that same false confidence right now. Anthropic, OpenAI, and Google are all shipping better models every few months, and "better" and "the same" are two very different things. If you don't have your own regression tests for every AI-dependent workflow, you're not testing your product. You're trusting someone else's benchmarks to catch your edge cases. They won't.