Models·3 min read·MarkTechPost

Ornith-1.0: Open Coding Models That Self-Scaffold

DeepReinforce has open-sourced Ornith-1.0, an MIT-licensed family of coding models (9B to 397B) that learn to write their own agent scaffolds during reinforcement training. The 397B flagship resolves 82.4% of SWE-Bench Verified issues — second only to Claude Opus 4.8 — while the 9B model runs on a single 80GB GPU, putting frontier-class agentic coding within reach of self-hosters.

Ornith-1.0 cracks the open-weights coding race SWE-Bench Verified (% resolved) · MIT license · 9B / 31B / 35B / 397B 87.6 Opus 4.8 82.4 Ornith-1.0 397B 80.8 Opus 4.7 OPEN WEIGHTS BITSMINDS.COM
Share:

A research outfit called DeepReinforce has released Ornith-1.0, a family of open-weight coding models that does something most of its rivals do not: it learns to build its own tooling. Shipped under a permissive MIT license on Hugging Face, the lineup comes in four sizes — a 9B dense model, a 31B dense model, a 35B mixture-of-experts model that activates roughly 3B parameters per token, and a 397B MoE flagship — built atop Gemma 4 and Qwen 3.5 foundations.

The headline idea is “self-scaffolding.” Most agentic coding systems wrap a model in a fixed, hand-designed harness that decides how it plans, calls tools and inspects results. Ornith-1.0 instead treats that scaffold as something to be learned during reinforcement training. Each RL step runs in two stages: conditioned on the task and the scaffold it used most recently, the model first proposes a refined scaffold for that specific problem, then generates a solution using it. Rewards flow back to both stages, so effective orchestration strategies emerge on their own rather than being scripted by engineers.

The numbers suggest the approach works. On SWE-Bench Verified, the 397B flagship resolves 82.4% of issues, placing it second only to Claude Opus 4.8 at 87.6% among the systems DeepReinforce listed and ahead of Opus 4.7’s 80.8%. On Terminal-Bench 2.1 it scores 77.5, beating Opus 4.7 (70.3) while trailing Opus 4.8 (85) and Zhipu’s GLM-5.2-744B (81.0). The smaller models punch above their weight too: the 35B MoE hits 64.2 on Terminal-Bench 2.1, well past Qwen 3.5-397B’s 53.5 despite being a fraction of the size.

Letting a model rewrite its own scaffold raises the obvious risk of reward hacking, and DeepReinforce says it built three defenses against it: an immutable outer trust boundary that limits what the environment can touch, a deterministic monitor that flags prohibited actions, and a frozen LLM judge that acts as a verification veto. The goal is to let orchestration strategies evolve freely while keeping the model from gaming its own training signal.

For teams that want to run the models rather than just read about them, the weights are straightforward to serve through vLLM with OpenAI-compatible endpoints, and the 9B variant fits on a single 80GB GPU in bf16 at roughly 19GB. That accessibility is the real story for BitsMinds readers: an MIT-licensed model with no regional restrictions that lands within striking distance of a frontier proprietary system on real-world coding benchmarks, and that openly publishes the self-scaffolding technique behind it.

Want AI news before everyone else?

The morning's most important AI stories, straight to your inbox. No fluff.

Related Articles