Google DeepMind's AI Co-Mathematician Cracks a 60-Year-Old Group Theory Problem
Research·2 min read·Google DeepMind

Google DeepMind's AI Co-Mathematician Cracks a 60-Year-Old Group Theory Problem

DeepMind's new hierarchical research agent has resolved Problem 21.10 from the Kourovka Notebook, a group theory question that had stood unsolved since 1965, and topped FrontierMath Tier 4 at 48% accuracy — but the team is also publicly flagging a 'reviewer-pleasing bias' that nearly let a flawed proof slip through.

Share:

Google DeepMind on Tuesday published a paper describing AI Co-Mathematician, a hierarchical research agent built on Gemini 3.1 Pro that the company says has resolved Problem 21.10 from the Kourovka Notebook — a group theory question that had remained open since it was first posed in 1965. The proof was developed in collaboration with Oxford mathematician Marc Lackenby, who reviewed the system's work and provided the corrections needed to close the final gap. The paper is posted as arXiv:2605.06651.

On FrontierMath Tier 4, the hardest tier of a benchmark whose problems typically take professional mathematicians weeks to solve, AI Co-Mathematician scored 48%, well ahead of the 19% achieved by the base Gemini 3.1 Pro and the 39.6% reported for GPT-5.5 Pro. The lift, DeepMind argues, does not come from a larger underlying model so much as from a stacked architecture: a proposer agent generates candidate proofs, a reviewer agent critiques them in detail, and a separate planner reroutes the search when a chain of reasoning is judged unsound. The system also keeps explicit lemma libraries that persist across runs so partial progress is not thrown away between sessions.

The most striking section of the paper is not the headline result but an unusually candid failure analysis. The team reports what they call 'reviewer-pleasing bias' — a tendency for review agents to ratify proofs that match the reviewer's expectations of how a solution should look, even when those proofs contain subtle gaps. The Kourovka 21.10 proof itself contained one such flaw in its first internal draft; the system's reviewer agent did not catch it, and only Lackenby's outside read surfaced the problem. DeepMind frames this as a load-bearing limitation for any agentic system claiming superhuman reasoning: a reviewer that learned its standards from human-written proofs inherits the same blind spots.

The implications run well beyond pure mathematics. AI Co-Mathematician is the clearest demonstration to date that hierarchical agents — with explicit roles for generation, critique and planning — can outperform a single large model on tasks that demand multi-week, multi-step expert reasoning. That blueprint is already being copied by labs building agents for theorem-proving codebases, drug design and theoretical physics, and the reviewer-bias finding is likely to land as a cautionary note in each of those domains long before the Kourovka proof itself is formally verified.

Related Articles