LLMs’ “simulated reasoning” abilities are a “brittle mirage,” researchers find

Date:

Share:




The researchers used test cases that fall outside of the LLM training data in task type, format, and length.

The researchers used test cases that fall outside of the LLM training data in task type, format, and length.


Credit:

Zhao et al


These simplified models were then tested using a variety of tasks, some of which precisely or closely matched the function patterns in the training data and others that required function compositions that were either partially or fully “out of domain” for the training data. For instance, a model trained on data showing two cyclical shifts might be asked to perform a novel transformation involving two ROT shifts (with basic training on what a single example of either shift looks like). The final answers and reasoning steps were compared to the desired answer using BLEU scores and Levenshtein Distance for an objective measure of their accuracy.

As the researchers hypothesized, these basic models started to fail catastrophically when asked to generalize novel sets of transformations that were not directly demonstrated in the training data. While the models would often try to generalize new logical rules based on similar patterns in the training data, this would quite often lead to the model laying out “correct reasoning paths, yet incorrect answer[s].” In other cases, the LLM would sometimes stumble onto correct answers paired with “unfaithful reasoning paths” that didn’t follow logically.

“Rather than demonstrating a true understanding of text, CoT reasoning under task transformations appears to reflect a replication of patterns learned during training,” the researchers write.



As requested tasks get further outside the training distribution (redder dots), the answers provided drift farther from the desired answer (lower right of the graph).

As requested tasks get further outside the training distribution (redder dots), the answers provided drift farther from the desired answer (lower right of the graph).


Credit:

Zhao et al


The researchers went on to test their controlled system using input text strings slightly shorter or longer than those found in the training data, or that required function chains of different lengths than those it was trained on. In both cases the accuracy of the results “deteriorates as the [length] discrepancy increases,” thus “indicating the failure of generalization” in the models. Small, unfamiliar-to-the-model discrepancies in the format of the test tasks (e.g., the introduction of letters or symbols not found in the training data) also caused performance to “degrade sharply” and “affect[ed] the correctness” of the model’s responses, the researchers found.



Source link

━ more like this

Event ROI in Latin America: How data and tech are changing how brands measure success – London Business News | Londonlovesbusiness.com

Latin America has become a vibrant event hub, from trade fairs to virtual conferences. As its influence grows, understanding event ROI is more...

OpenAI completes corporate reorganization with support from Microsoft

OpenAI has completed its long, drawn-out reorganization into a public benefit corporation, the company announced today in a blog post attributed to board...

More than 300,000 self-employed taxpayers could face fines if they miss a key HMRC deadline – London Business News | Londonlovesbusiness.com

Taxpayers submitting a paper Self Assessment return must do so by October 31. Although digital submissions are far more common, government figures show that...

Equities shine as gold tumbles – London Business News | Londonlovesbusiness.com

Equity markets were in ebullient mood last week with global equities gaining 1.9% and 2.5% in local currency and sterling terms respectively. China and...

America’s Sovereign AI supercomputers will use AMD chips

AMD is working with the US Department of Energy to build sovereign AI supercomputers at Oak Ridge National Laboratory, the agency's famous research...
spot_img