Research·3 min read·arXiv (Princeton University)

Most AI Models Go Broke Running a Company, Princeton Finds

Princeton’s new CEO-Bench drops AI agents into the chief-executive’s chair of a simulated startup with $1M and 500 days to grow it. Of the frontier models tested, only Claude Opus 4.8 and GPT-5.5 ended their best runs above the starting cash — and neither did so consistently. Most went bankrupt, exposing how far long-horizon agents still are from the autonomy vendors are selling.

CAN AN AI RUN A COMPANY FOR 500 DAYS? Princeton's CEO-Bench: only 2 of the models tested ended above the $1M they began with $1M start IN THE BLACK Claude Opus 4.8 IN THE BLACK GPT-5.5 Most models: below start, or bankrupt BITSMINDS.COM
Share:

Today's leading AI models can write a quarter's worth of code in an afternoon and pass the bar exam, but ask one to keep a company solvent for 500 days and most of them go broke. That is the blunt finding of CEO-Bench, a new long-horizon benchmark from researchers at Princeton University that drops an AI agent into the chief-executive's chair and lets the clock run.

The benchmark is built around a simulated subscription-software startup called NovaMind. Each agent begins with $1 million in the bank and 500 simulated days to grow it, operating through a programmable interface that reaches into business databases, internal management tools and the company's social media. The environment is deliberately unforgiving: the market is partially observable and noisy, customer churn and competitor moves arrive with a lag, and one bad decision can quietly compound for weeks before it shows up in the numbers. Success demands exactly the skills that single-shot benchmarks never measure — reading messy, interconnected data, turning weak signals into strategy, and staying coherent across hundreds of linked decisions.

The results were sobering. Of the frontier systems put through the simulation, only Anthropic's Claude Opus 4.8 and OpenAI's GPT-5.5 managed to finish their best runs with more cash than they started with. Even those two could not do it reliably — neither consistently turned a profit across repeated attempts — and many other models burned through the opening balance and went bankrupt well before day 500. The authors, Haozhe Chen, Karthik Narasimhan and Zhuang Liu, frame the takeaway plainly: state-of-the-art language models still lack the durable, adaptive judgment that running a real business over time requires.

CEO-Bench lands in the middle of an industry sprint toward "agentic" AI, where the pitch is that models will not just answer questions but autonomously execute multi-step work for days at a stretch. Most popular evaluations, however, reward a single correct answer or a short tool-using episode. A 500-day simulation with delayed, coupled consequences is a far closer proxy for the autonomy vendors are actually selling — and the gap it exposes between benchmark headlines and sustained competence is the whole point.

None of this means AI cannot help run a company; it means handing one the keys, unsupervised, for a year and a half is not yet a winning strategy. For now the result reads as a useful corrective: the models that top reasoning leaderboards are not the same thing as models that can be trusted to compound good decisions over time. CEO-Bench gives the field a way to measure that difference — and, the authors hope, a target to start closing it.

Want AI news before everyone else?

The morning's most important AI stories, straight to your inbox. No fluff.

Related Articles

ANTHROPIC ECONOMIC INDEX · JUNE 2026 The day has a rhythm. So does AI. ~50% personal use on weekends 7am · News peak 6pm · Recipes 2.3× 5am · Sleep advice BITSMINDS.COM
Research

Anthropic's Economic Index Maps AI's Daily Rhythms

GPT-5 PRO CRACKS A 3-YEAR T-CELL PUZZLE OpenAI's model proposed a cancer-immunology mechanism — then a lab confirmed it STALLED SINCE 2022 Glucose-starved T cells behaved oddly — no mechanism fit the data GPT-5 PRO'S HYPOTHESIS N-linked glycosylation, not glycolysis — driven by memory T cells & IL-2 CONFIRMED IN LAB Held in anti-CD19 CAR-T killing tests Derya Unutmaz · The Jackson Laboratory — an expert-led workflow, not autonomous discovery BITSMINDS.COM
Research

GPT-5 Pro Cracks a 3-Year T-Cell Puzzle, Lab Confirms

WHAT IS AGENTIC AI? AI that reasons, picks its own tools, and acts — explained from scratch AGENT Goal / task LLM reasoning Tools · MCP Action in the world BITSMINDS.COM
Research

What Is Agentic AI? A Plain-English Guide to AI Agents