Apple study exposes deep cracks in LLMs’ “reasoning” capabilities

Date:

Share:


This kind of variance—both within different GSM-Symbolic runs and compared to GSM8K results—is more than a little surprising since, as the researchers point out, “the overall reasoning steps needed to solve a question remain the same.” The fact that such small changes lead to such variable results suggests to the researchers that these models are not doing any “formal” reasoning but are instead “attempt[ing] to perform a kind of in-distribution pattern-matching, aligning given questions and solution steps with similar ones seen in the training data.”

Don’t get distracted

Still, the overall variance shown for the GSM-Symbolic tests was often relatively small in the grand scheme of things. OpenAI’s ChatGPT-4o, for instance, dropped from 95.2 percent accuracy on GSM8K to a still-impressive 94.9 percent on GSM-Symbolic. That’s a pretty high success rate using either benchmark, regardless of whether or not the model itself is using “formal” reasoning behind the scenes (though total accuracy for many models dropped precipitously when the researchers added just one or two additional logical steps to the problems).



An example showing how some models get mislead by irrelevant information added to the GSM8K benchmark suite.

An example showing how some models get mislead by irrelevant information added to the GSM8K benchmark suite.


Credit:

Apple Research


The tested LLMs fared much worse, though, when the Apple researchers modified the GSM-Symbolic benchmark by adding “seemingly relevant but ultimately inconsequential statements” to the questions. For this “GSM-NoOp” benchmark set (short for “no operation”), a question about how many kiwis someone picks across multiple days might be modified to include the incidental detail that “five of them [the kiwis] were a bit smaller than average.”

Adding in these red herrings led to what the researchers termed “catastrophic performance drops” in accuracy compared to GSM8K, ranging from 17.5 percent to a whopping 65.7 percent, depending on the model tested. These massive drops in accuracy highlight the inherent limits in using simple “pattern matching” to “convert statements to operations without truly understanding their meaning,” the researchers write.



Introducing irrelevant information to the prompts often led to “catastrophic” failure for most “reasoning” LLMs

Introducing irrelevant information to the prompts often led to “catastrophic” failure for most “reasoning” LLMs


Credit:

Apple Research


In the example with the smaller kiwis, for instance, most models try to subtract the smaller fruits from the final total because, the researchers surmise, “their training datasets included similar examples that required conversion to subtraction operations.” This is the kind of “critical flaw” that the researchers say “suggests deeper issues in [the models’] reasoning processes” that can’t be helped with fine-tuning or other refinements.



Source link

━ more like this

Your future BMW electric M3 will still sound like a real M car

The transition to electric vehicles has always had one major stumbling block for car enthusiasts: the sound. Or rather, the lack of it....

New study shows AI isn’t ready for office work

It has been nearly two years since Microsoft CEO Satya Nadella predicted that generative AI would take over knowledge work, but if you...

This is the tech that makes Volvo’s latest EV a major step forward

When it comes to EVs, Volvo hasn’t been afraid to experiment. It’s tried repurposing a platform from its internal-combustion cars (for the EX40...

Report reveals that OpenAI’s GPT-5.2 model cites Grokipedia

OpenAI may have called GPT-5.2 its "most advanced frontier model for professional work," but tests conducted by the Guardian cast doubt on its...
spot_img