Jun 27, 2026

Ornith-1.0 for Local AI in 2026: Which GPU Runs DeepReinforce's MIT-Licensed Coding Model?

By RunAIHome Team · 11 min read

ornithdeepreinforcelocal-llmcodinggpumoe

TL;DR: Ornith-1.0 is DeepReinforce’s new MIT-licensed coding family — 9B Dense, 31B Dense, 35B MoE, and 397B MoE, post-trained on Gemma 4 and Qwen 3.5. The home-lab pick is the 35B MoE: ~3B active parameters per token make it fast, and the Q4_K_M GGUF is 21.2 GB, so it just fits a single 24 GB card. The catch: 21.2 GB on a 24 GB GPU leaves almost no room for long context.

	9B Dense	35B MoE	397B MoE
Best for	8–12 GB cards	The 24 GB sweet spot	Cloud / API only
Q4 size	~6 GB	21.2 GB (Q4_K_M)	~225 GB+
Active params	9B (dense)	~3B per token	~? per token
Runs on a single consumer GPU?	Yes, easily	Yes, on 24 GB	No
The catch	Weakest of the family	No headroom for 256K context	No card holds it

Honest take: If you have a 24 GB card, grab the 35B MoE Q4_K_M — it’s the rare model that gives you MoE speed and a license you can actually ship a product on. If you’re on 8–16 GB, run the 9B and keep your expectations modest. The 397B is an API model; don’t try to buy hardware for it.

The local-AI release calendar has been relentless this month, but Ornith-1.0 is worth stopping for — not because it’s the biggest, but because it lands the two things home-labbers actually ask for: a permissive license and a variant that runs fast on a card you already own.

What Ornith-1.0 actually is

DeepReinforce released the Ornith-1.0 family on June 25, 2026, under the MIT license with no regional restrictions — every checkpoint, including the GGUF and FP8 builds, ships under that license on Hugging Face. That alone separates it from a lot of “open weight” releases that bolt on usage clauses or research-only terms.

The family spans four checkpoints, all post-trained on Gemma 4 and Qwen 3.5 bases:

Ornith-1.0-9B — dense, edge/resource-constrained target
Ornith-1.0-31B — dense
Ornith-1.0-35B — sparse Mixture-of-Experts
Ornith-1.0-397B — flagship MoE

The headline feature is the training method, not the size. Ornith is a self-scaffolding model: during reinforcement learning it learns to write its own harness — the tool-use loop, the test scaffold — and jointly optimizes that scaffold alongside the code it produces. It’s also reasoning-first: each assistant turn opens with a chain-of-thought block, and the serving stack returns that reasoning in a separate field from the final answer. For a coding agent, that’s the right shape: it plans, then acts.

If you’ve read our piece on why local LLMs got good in 2026, this is the same story playing out — sparse activation plus better post-training, not raw parameter count, is what’s closing the gap.

The benchmarks (and where to be skeptical)

The vendor numbers are strong, and a few are striking enough to flag as vendor-reported until independent runs land:

Ornith-1.0-397B: 82.4 on SWE-Bench Verified and 77.5 on Terminal-Bench 2.1. DeepReinforce positions this above Claude Opus 4.7; for context, Claude Opus 4.8 sits at 87.6 on SWE-Bench Verified, so the 397B trails only the very top of the closed-model field on that test.
Ornith-1.0-35B MoE: 64.2 on Terminal-Bench 2.1 — above Qwen 3.5-397B’s 53.5, a model with more than ten times the total parameter count. If that holds up under independent testing, it’s the most interesting result in the release.
Ornith-1.0-9B Dense: 43.1 on Terminal-Bench 2.1, essentially matching Gemma 4-31B’s 42.1.

A 35B MoE beating a 397B dense model on an agentic benchmark is exactly the kind of claim that needs third-party confirmation — vendor benchmark suites tend to flatter the home team. Treat these as a reason to try the model, not as settled fact. We’ve taken the same cautious line on every fresh-drop coding model, from Kimi K2.7 to Qwen3-Coder-Next.

Which GPU runs which variant

This is the part that matters for your wallet. Here’s how each variant maps to real hardware.

Ornith-1.0-9B — for 8 GB to 16 GB cards

The 9B dense weights are about 6 GB at Q4 quantization and roughly 19 GB in BF16. At Q4_K_M it runs comfortably on 6–8 GB of VRAM, which means it’s the variant for an RTX 3060 12GB, an RTX 4060 Ti 8GB or 16GB, or even an older 8 GB card. With a 16 GB card you can move up to Q6_K or Q8_0 and still leave plenty of room for context.

Being a dense 9B, it’s bandwidth-bound — generation speed scales with your card’s memory bandwidth, not its FLOPS. On a modern 16 GB card you’ll get interactive speeds, but understand the trade-off: this is the weakest member of the family. It’s a capable local autocomplete and small-task assistant, not a replacement for a frontier agent.

Ornith-1.0-35B MoE — the 24 GB sweet spot

This is the one to care about. The 35B MoE is a sparse model with 256 routed experts, 8 active per token plus a shared expert, across 40 layers, activating roughly 3B parameters per token. That architecture is the whole point: all 35B weights have to sit in VRAM, but only ~3B are read per token, so it generates far faster than a dense model of the same footprint.

The official GGUF sizes:

Quant	Size	Fits
Q4_K_M	21.2 GB	24 GB card (tight)
Q5_K_M	24.7 GB	32 GB card
Q6_K	28.5 GB	32 GB card
Q8_0	36.9 GB	48 GB+ / multi-GPU

The practical read: Q4_K_M at 21.2 GB fits a single 24 GB GPU, but barely. That leaves under 3 GB for the KV cache and runtime overhead. You’ll run it fine at 8K–16K context; the model’s full 256K context window is a cloud-serving figure, not something you’ll reach on 24 GB. If you want real context headroom, you want a 32 GB RTX 5090 and the Q4_K_M, or you accept short context on 24 GB.

What about speed? We don’t have independent tokens/sec measurements for Ornith yet, so we won’t invent one. But we do have a measured reference point on this exact site: Nemotron-Cascade 2 is a ~3B-active MoE (30B-A3B) that hits 187 tok/s on a used RTX 3090 at a comparable quant. Ornith-1.0-35B has near-identical active-parameter math, so expect it to land in the same neighborhood on the same hardware — fast enough for genuinely interactive agentic coding. We’ll update this with real numbers once community benchmarks are out. For more on how 3B-active MoE compares to dense models at the same VRAM, see our Qwen 3.6 35B-A3B guide.

The best-value card for this remains the used RTX 3090: 24 GB, 936 GB/s of bandwidth, and a used average around $1,070 as of June 2026. The RTX 4090 is faster but costs roughly twice as much used for the same 24 GB ceiling.

Ornith-1.0-31B Dense — capable, but awkward

The 31B dense variant sits in an odd spot. As a dense 31B it needs a similar VRAM footprint to the 35B MoE at the same quant (call it ~18–20 GB at Q4), but because it’s dense it activates all 31B parameters per token — so it’ll be meaningfully slower than the 35B MoE while taking up about the same space. Unless you have a specific reason to prefer a dense model’s behavior, the 35B MoE is the better pick on identical hardware. This is the same dense-vs-MoE trade we walked through in the Codestral 2 guide.

Ornith-1.0-397B MoE — rent it, don’t buy for it

The flagship is not a home-lab model. The FP8 checkpoint alone is on the order of 225 GB+, well beyond any single consumer GPU and beyond most multi-GPU home builds. If you want to use the 397B, the sane paths are the hosted API or renting a multi-GPU instance by the hour. A cloud GPU host like RunPod lets you spin up an 8×80GB node to evaluate the flagship for a few dollars rather than spending five figures on hardware you’ll use occasionally — the same rent-vs-buy logic we lay out in RunPod vs Local GPU.

How to run it locally

The 9B and 35B GGUF builds work with the usual local stack — llama.cpp, and Ollama/LM Studio once they pick up the model. Two practical notes:

Pick the largest quant that leaves context room. For the 35B on 24 GB, that’s Q4_K_M and a modest context window. Don’t reach for Q6_K on a 24 GB card — it won’t fit with any usable cache.
Expect a reasoning block. Because Ornith is reasoning-first, raw output includes a chain-of-thought section before the answer. If you’re wiring it into an agent or IDE, make sure your client handles the separate reasoning field — otherwise you’ll see the model “thinking out loud” in your final output.

If you hit a wall loading the 35B on a 24 GB card, it’s almost always the context length pushing you over the VRAM limit. Our CUDA out of memory fix guide covers the exact knobs (context size, KV-cache quantization, flash attention) to claw back the headroom.

For pairing Ornith with an actual coding workflow — editor integration, agent loops — our sister site aicoderscope.com covers the tooling side, and aifoss.dev tracks the open-source self-hosting stack.

Honest take

Ornith-1.0 is a genuinely good release for the local crowd, for two unglamorous reasons. First, the MIT license means you can build and ship commercial work on it without a lawyer — that’s rarer than it should be. Second, the 35B MoE actually fits a card people own: 21.2 GB at Q4_K_M on a 24 GB GPU, with ~3B active params keeping it fast.

The vendor benchmarks are loud — a 35B beating a 397B, a 397B near Opus territory — and you should wait for independent runs before believing the rankings. But you don’t have to believe the leaderboard to get value here: download the 35B Q4_K_M, point it at your codebase, and judge it on your own tasks. If it’s even close to the claims, it’s the best MIT-licensed model your 24 GB card has run all year.

If you’re shopping for the hardware to run it, start with our used RTX 3090 value analysis and the broader open-source LLM shootout.

FAQ

Can I run Ornith-1.0-35B on a 16 GB GPU? Not at a useful quant. The smallest 35B GGUF (Q4_K_M) is 21.2 GB, which exceeds 16 GB. On a 16 GB card, run the 9B variant instead — it’s about 6 GB at Q4 and leaves ample room for context.

Is Ornith-1.0 really MIT licensed? Yes. DeepReinforce released every checkpoint — full precision, FP8, and GGUF — under the MIT license with no regional restrictions, as of the June 25, 2026 launch. That permits commercial use without the research-only or attribution clauses some other open-weight models attach.

Why is the 35B MoE faster than the 9B dense model? The 35B is a Mixture-of-Experts that activates only ~3B parameters per token, while the 9B dense activates all 9B. Generation speed tracks active parameters, so the 35B can actually generate faster than the 9B — while needing far more VRAM to hold all its weights.

What card should I buy for the 35B? A 24 GB card is the floor. A used RTX 3090 (~$1,070, 936 GB/s) is the best value; an RTX 4090 is faster but roughly double the price for the same 24 GB ceiling. For real context headroom, a 32 GB RTX 5090 lets you run Q4_K_M with room to spare.

Can any home setup run the 397B? Realistically, no. The FP8 build is 225 GB+ and needs multi-GPU datacenter-class hardware. Use the hosted API or rent a multi-GPU cloud instance to evaluate it.

Sources

Last updated June 27, 2026. Benchmarks are vendor-reported pending independent verification; prices and specs change — verify current rates before purchasing.

Recommended Gear

RTX 3090 — 24 GB used value king for the 35B MoE
RTX 4090 — faster 24 GB option
RTX 5090 — 32 GB for context headroom
RTX 3060 / RTX 4060 Ti — budget cards for the 9B variant

Was this article helpful?