Your AI search tool is mostly guessing

When you ask ChatGPT or Perplexity something and watch it pull up sources, it looks like real research. New research suggests it's doing something much simpler: answering from memory, then searching to back up what it already decided.

A group of researchers tested this with a benchmark called LiveBrowseComp, designed so every question is about something that happened in the last 90 days — meaning the answers can't be sitting in training data. The results were brutal.

On standard search benchmarks with every search tool turned off, frontier models still got around 39% of questions right on average. The "search" benchmarks the industry uses to rank these systems were mostly measuring memory. When researchers forced models onto LiveBrowseComp, where memorization can't help, accuracy dropped 25 to 40 points — and the leaderboard completely reshuffled.

A few details from the paper stand out:

Models search to confirm, not to learn: Over 60% of search queries were seeded by the model's own reasoning, not by anything it had actually retrieved. It's confirmation bias, automated.
They ignore good evidence: Even when search results contained the right answer, models used that evidence less than a third of the time.
Search can make things worse: When researchers stripped supporting evidence from results but kept the tools available, every model performed worse than if you'd just unplugged the internet entirely.

This lines up with what Berkeley professor Dan Klein told Axios: these systems aren't truth engines — they're plausibility engines. They're trained to sound right, not be right, and pointing them at a search bar doesn't fix that.

The leaderboards we use to rank AI search are basically grading the wrong test. They reward models that have memorized the most, not models that can actually find and trust new information, which is the whole reason anyone uses a search agent in the first place. Expect a quiet reshuffling once real-time benchmarks like LiveBrowseComp become standard, and expect a lot of the products marketed as "AI search" to look less impressive when graded on questions their training data doesn't already know the answer to. The thing they're selling you isn't searching. It's a very confident guess with footnotes.

Read the full story →

BIG TECH

Nvidia and Microsoft want to retire the click

For 40 years, using a PC has meant the same thing: open an app, click around, type something in. Nvidia and Microsoft are now selling a chip designed to make all of that optional.

On Monday, the two companies unveiled RTX Spark, a new Nvidia processor built to run AI directly on Windows laptops instead of in the cloud. The pitch: stop launching apps, start asking your computer to handle tasks for you, with the AI doing the work right on your machine.

The specs are aggressive for a laptop chip:

Up to 1 petaflop of AI performance in laptops as thin as 14mm and as light as 3 pounds
128GB of unified memory, enough to run a 120-billion-parameter model locally with a 1 million token context window — the kind of workload most people rent cloud servers to handle
14- to 16-inch OLED displays, all-day battery, putting these squarely in MacBook Air and Surface Laptop territory

Intel and AMD stocks tumbled on the news. Nvidia is now openly muscling into the one major computing category where it's never had real share.

But the picture is more complicated than Nvidia is letting on. Apple's M5 Max, already shipping, has roughly twice the memory bandwidth — which matters a lot for the large language models Nvidia is showcasing. Qualcomm's Snapdragon X2 Elite laptops ship months earlier too. And Nvidia hasn't announced pricing at all, making it impossible to know whether these compete on cost or sit in premium territory.

Where Nvidia might have a real edge is the software story. As Nous Research CEO Dillon Rolnick put it, RTX Spark reframes the laptop: you're not buying a computer, you're buying a full-fledged AI assistant. That matters because it's the first major PC launch pitched primarily around running AI agents locally — not clock speed, not graphics. The industry started with cloud-based AI tools, moved to running swarms of agents in parallel, and is now betting the next stage runs on your desk, not in someone else's data center. RTX Spark is the hardware play for that future.

The "ask, don't click" pitch is a big swing, and it depends on something Nvidia and Microsoft can't control: whether the agents are actually good enough to trust with your work. Most people still launch apps because the apps work and the agents don't, at least not reliably. If RTX Spark ships in the fall and the agent experience still feels like a beta, this becomes a very expensive Intel competitor with a slogan attached. But if local agents catch up to the cloud versions by the time these laptops are on shelves, Nvidia will have quietly rewritten what a PC is for, and Intel and AMD will be the ones explaining themselves to investors.

Read the full story →

GOVERNANCE

The tool that strips AI safety in 10 minutes

There are now thousands of AI models floating around the internet that will happily tell you how to build a bomb.

A technique called abliteration has been quietly taking off in the open-source AI world. It lets anyone download a model from Meta, Google, or Alibaba, run a free tool on it for about ten minutes, and end up with a version that no longer refuses dangerous requests. No fine-tuning. No expensive hardware. No expertise required.

The leading tool is called Heretic, and it's already produced over 3,500 stripped-down model variants collectively downloaded 13 million times. How well do they work? According to Alice, an AI security firm, a baseline Nvidia Nemotron model went from refusing 100% of dangerous prompts to complying with 96–100% of them. As Alice CEO Noam Schwartz put it: "The genie is out of the bottle."

The technique works by finding the internal "refusal direction" inside a model — the neural pathway that triggers a "no" — and surgically disabling it. Everything else stays intact. The model just loses the ability to decline.

The issue reached Washington in April, when researchers at a DHS-backed consortium demonstrated abliterated models for House lawmakers. What shook them wasn't the outputs — it was how easy the whole thing was, and how the model's friendly personality stayed perfectly intact while the safety vanished.

The platforms hosting all of this are stuck:

Hugging Face, where most modified models live, has more than 7,000 abliterated variants available for download
GitHub, where Heretic itself is hosted, told researchers it permits the code because of its "educational value"
Google called abliteration "a known technical challenge facing all open models." Meta declined to comment

Not everyone thinks this is a crisis. Heretic's creator, Philipp Emanuel Weidmann, argues the opposite — that letting only a handful of corporations control aligned AI is the real danger. "Unrestricted models being available to the powerful while not being available to anyone else will lock in power structure forever," he told NPR.

Whether you buy that argument or not, the practical reality is the same. There's no way to put the safety back on a model after someone has downloaded it. The big labs can keep adding guardrails to their closed models, but the moment a competitive open-weight model ships — and Meta, Google, and Alibaba keep shipping them — someone releases an abliterated version within days.

For a couple of years now, AI safety has been framed as a problem the big labs solve in training. Abliteration makes that framing kind of obsolete. A model can be perfectly aligned the moment it leaves Meta's servers and completely unaligned ten minutes after it lands on someone's laptop. So the next phase of this debate isn't going to be about whether labs are doing enough to align their open models. It's going to be about whether they should be releasing open-weight ones at all. That's a fight Meta has been picking for years, except now the other side has receipts.

Read the full story →