Most AI Models Go Broke Running a Company, Princeton Finds
Princeton’s new CEO-Bench drops AI agents into the chief-executive’s chair of a simulated startup with $1M and 500 days to grow it. Of the frontier models tested, only Claude Opus 4.8 and GPT-5.5 ended their best runs above the starting cash — and neither did so consistently. Most went bankrupt, exposing how far long-horizon agents still are from the autonomy vendors are selling.
Today's leading AI models can write a quarter's worth of code in an afternoon and pass the bar exam, but ask one to keep a company solvent for 500 days and most of them go broke. That is the blunt finding of CEO-Bench, a new long-horizon benchmark from researchers at Princeton University that drops an AI agent into the chief-executive's chair and lets the clock run.
The benchmark is built around a simulated subscription-software startup called NovaMind. Each agent begins with $1 million in the bank and 500 simulated days to grow it, operating through a programmable interface that reaches into business databases, internal management tools and the company's social media. The environment is deliberately unforgiving: the market is partially observable and noisy, customer churn and competitor moves arrive with a lag, and one bad decision can quietly compound for weeks before it shows up in the numbers. Success demands exactly the skills that single-shot benchmarks never measure — reading messy, interconnected data, turning weak signals into strategy, and staying coherent across hundreds of linked decisions.
The results were sobering. Of the frontier systems put through the simulation, only Anthropic's Claude Opus 4.8 and OpenAI's GPT-5.5 managed to finish their best runs with more cash than they started with. Even those two could not do it reliably — neither consistently turned a profit across repeated attempts — and many other models burned through the opening balance and went bankrupt well before day 500. The authors, Haozhe Chen, Karthik Narasimhan and Zhuang Liu, frame the takeaway plainly: state-of-the-art language models still lack the durable, adaptive judgment that running a real business over time requires.
CEO-Bench lands in the middle of an industry sprint toward "agentic" AI, where the pitch is that models will not just answer questions but autonomously execute multi-step work for days at a stretch. Most popular evaluations, however, reward a single correct answer or a short tool-using episode. A 500-day simulation with delayed, coupled consequences is a far closer proxy for the autonomy vendors are actually selling — and the gap it exposes between benchmark headlines and sustained competence is the whole point.
None of this means AI cannot help run a company; it means handing one the keys, unsupervised, for a year and a half is not yet a winning strategy. For now the result reads as a useful corrective: the models that top reasoning leaderboards are not the same thing as models that can be trusted to compound good decisions over time. CEO-Bench gives the field a way to measure that difference — and, the authors hope, a target to start closing it.
Want AI news before everyone else?
The morning's most important AI stories, straight to your inbox. No fluff.