New secret math benchmark stumps AI models and PhDs alike

Date:

Share:


Epoch AI allowed Fields Medal winners Terence Tao and Timothy Gowers to review portions of the benchmark. “These are extremely challenging,” Tao said in feedback provided to Epoch. “I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages.”

A chart showing AI models’ limited success on the FrontierMath problems, taken from Epoch AI’s research paper.


Credit:

Epoch AI

To aid in the verification of correct answers during testing, the FrontierMath problems must have answers that can be automatically checked through computation, either as exact integers or mathematical objects. The designers made problems “guessproof” by requiring large numerical answers or complex mathematical solutions, with less than a 1 percent chance of correct random guesses.

Mathematician Evan Chen, writing on his blog, explained how he thinks that FrontierMath differs from traditional math competitions like the International Mathematical Olympiad (IMO). Problems in that competition typically require creative insight while avoiding complex implementation and specialized knowledge, he says. But for FrontierMath, “they keep the first requirement, but outright invert the second and third requirement,” Chen wrote.

While IMO problems avoid specialized knowledge and complex calculations, FrontierMath embraces them. “Because an AI system has vastly greater computational power, it’s actually possible to design problems with easily verifiable solutions using the same idea that IOI or Project Euler does—basically, ‘write a proof’ is replaced by ‘implement an algorithm in code,'” Chen explained.

The organization plans regular evaluations of AI models against the benchmark while expanding its problem set. They say they will release additional sample problems in the coming months to help the research community test their systems.



Source link

━ more like this

AI has a different kind of bias problem, but it’s an often repeated one

AI bias is usually talked about in terms of algorithms: skewed datasets, flawed outputs, and stereotypes baked into models. But new research suggests...

Apple pulled the most un-Apple move with a price drop on the Studio Display XDR

Just weeks after launch, Apple has quietly done something it rarely does this quickly: it’s cut the price of its brand-new Studio Display...

AI Models Lie, Cheat, and Steal to Protect Other Models From Being Deleted

In a recent experiment, researchers at UC Berkeley and UC Santa Cruz asked Google’s artificial intelligence model Gemini 3 to help clear up...

‘Thank You For Generating With Us!’ Hollywood’s AI Acolytes Stay on the Hype Train

While this type of hype is predictable at industry-led events, again and again summit attendees were reminded that generative AI isn’t just another...
spot_img