Research·2 min read·MarkTechPost

OpenAI's LifeSciBench Puts AI Through a 750-Task Life-Science Exam — Top Model Passes Just 36%

OpenAI's new 750-task benchmark grades AI on real life-science research judgment. The best model, GPT-Rosalind, passed only 36% — and 22.8% of tasks stumped every model tested.

OpenAI's LifeSciBench Puts AI Through a 750-Task Life-Science Exam — Top Model Passes Just 36%
Share:

OpenAI on June 17 released LifeSciBench, a sprawling new benchmark that grades AI systems on the kind of messy, judgment-heavy work that real life-science research actually demands — and the early scores suggest the most advanced models still have a long way to go. The benchmark comprises 750 expert-authored tasks, and even the strongest model tested passed only about a third of them.

Unlike most biology benchmarks, which lean on narrow, fact-based questions with clean answers, LifeSciBench is built around free-response problems written the way one scientist would brief a colleague. The 750 tasks span seven workflows — evidence handling, analysis, design and optimization, scientific reasoning, validation, translation, and scientific communication — across seven biological domains including genomics, medicinal chemistry, and clinical science. Roughly 79% of the tasks require multiple reasoning steps, averaging four steps each, and many come bundled with real artifacts: the set ships with 1,062 attached sequences, figures, tables, PDFs, and chemical structures.

Scoring is rubric-based rather than multiple-choice. The benchmark defines 19,020 individual criteria — about 25 per task — each rewarding a specific fact, reasoning step, or numeric answer. Results are summarized two ways: a normalized rubric score that awards partial credit, and a stricter task pass rate counting only tasks that clear 70%. To build it, OpenAI enlisted a cohort of 173 Ph.D.-holding scientists to author tasks and 453 reviewers (97% with doctorates) to validate them, reaching 96% agreement on relevance and usefulness.

On the leaderboard, OpenAI's research-focused GPT-Rosalind led with a 0.576 normalized score and a 36.1% pass rate, ahead of GPT-5.5 (0.519, 25.7%), Google's Gemini 3.1 Pro (0.515, 23.6%), GPT-5.4 (0.479, 20.7%), and xAI's Grok 4.3 (0.399, 13.0%). The cracks showed most clearly when artifacts were involved: GPT-Rosalind's accuracy fell from 45.1% on text-only tasks to 28.1% when it had to reason over an attached figure or dataset.

Perhaps the most telling number is that 22.8% of the tasks were failed by every model tested — a stark reminder of how much headroom remains between today's frontier systems and the reasoning a working scientist takes for granted. The release lands alongside a companion OpenAI demonstration in which a near-autonomous AI chemist using GPT-5.4 improved a challenging reaction in medicinal chemistry, signaling the lab's growing push to position its models as genuine research collaborators rather than just question-answering tools.

Comments

Share your thoughts. Be kind.

0/2000

Loading comments…

Related Articles

AI SAFETY · OPENAI JUN 16 OpenAI now rehearses a model before it ships. Deployment Simulation replays 1.3M past chats through a new model to forecast misbehavior. STEP 1 Replay recent production chats STEP 2 Regenerate the reply, new model STEP 3 Grade it for misbehavior STEP 4 Estimate the deployment rate Median error 1.5x · caught “calculator hacking” in GPT-5.1 before release Built from ~1.3M de-identified conversations, Aug 2025 to Mar 2026. BITSMINDS.COM Source: OpenAI · MarkTechPost
Research

OpenAI’s “Deployment Simulation” Replays 1.3 Million Real Chats to Catch a Model Misbehaving Before Launch

HEALTHCARE AI · CLINICAL CONVERSATION MODEL JUN 11 A model built for the clinic. Nvidia and Abridge are training a doctor's AI from the ground up. Clinical conversation model · co-developed with Abridge BUILT ON NEMOTRON · HEALTHCARE-NATIVE · READY LATER IN 2026 ABRIDGENotes, visit summaries, billing-code checks NEMOTRONTrained on Nvidia's open model family HEALTHCARE-NATIVELearns medical terms early, not bolted on USE CASEDocumentation + clinical decision support AVAILABILITYExpected ready for use later in 2026 BITSMINDS.COM Source: WSJ · Nvidia
Research

Nvidia and Abridge Will Build a Clinical AI Model From Scratch on Nemotron

AI · EXPLAINEDHOW AIWORKSBITSMINDS · AI EXPLAINEDtokens → vectors → attention → answerBITSMINDS.COM
Research

How AI Actually Works: What Happens Between Your Prompt and the Answer