When you ask ChatGPT or Perplexity something and watch it pull up sources, it looks like the bot is actually scouring the web. New research suggests it's usually doing something much simpler: answering from memory, then searching to back up what it already decided.

A group of researchers tested this with a benchmark called LiveBrowseComp, which only asks questions about things that happened in the last 90 days. The idea is to force the models to actually go look stuff up, since the answers wouldn't be sitting in their training data.

The results were not flattering for the entire industry.

When the researchers ran the standard search benchmarks but turned off every search tool, frontier models still got around 39% of questions right on average. MiniMax M2.5 answered 44.5% of BrowseComp questions with no internet access at all. The "search" benchmarks the industry uses to rank these systems were mostly measuring memory.

It gets worse when you actually make them search. On LiveBrowseComp, where the answers don't exist in training data, every model the researchers tested fell below 2% accuracy without search. With search turned on, scores dropped 25 to 40 points compared to the older benchmarks. And the leaderboard completely reshuffled. GLM 5.1, which led BrowseComp at 68%, fell to 33.9. Models that looked mediocre on static tests jumped ahead.

A few details from the paper stand out:

  • Models search to confirm, not to learn: More than 60% of search queries in later browsing rounds were seeded by ideas the model wrote in its own reasoning, not by anything it had actually retrieved. It's confirmation bias, automated.
  • They ignore good evidence anyway: Even when search results contained the right answer, models used that evidence less than a third of the time across every system tested.
  • Search can make things worse: When researchers stripped the supporting evidence out of search results but kept the tools available, every model performed worse than if you had just unplugged the internet entirely. They couldn't tell good sources from bad ones, so they confidently retrieved garbage.

This lines up with a broader credibility problem that researchers have been flagging for a while. Dan Klein, a Berkeley professor and CTO of Scaled Cognition, told Axios that "these systems, they're not truth engines. They're plausibility engines." They're trained to sound right, not to be right, and pointing them at a search bar doesn't fix the underlying incentive.

It also connects to something we covered last month, when Wharton researchers found that users accept AI answers about 80% of the time even when those answers are wrong. If the model is mostly guessing from memory and the user mostly trusts whatever it says, the search citations underneath are more like decoration than verification.

Into the Valley

The leaderboards we use to rank AI search are basically grading the wrong test. They reward models that have memorized the most, not models that can actually find and trust new information, which is the whole reason anyone uses a search agent in the first place. Expect a quiet reshuffling once real-time benchmarks like LiveBrowseComp become standard, and expect a lot of the products marketed as "AI search" to look less impressive when graded on questions their training data doesn't already know the answer to. The thing they're selling you isn't searching. It's a very confident guess with footnotes.