Codestral 2 for Local AI in 2026: Apache 2.0, 22B Params, 256K Context — Which GPU Runs It Best

codestralmistrallocal-llmcodinggpuollama

TL;DR: Codestral 2 is Mistral’s 22B dense coding model, now Apache 2.0 — fully commercial-use legal as of April 2026. The Q4_K_M GGUF is 13.3 GB, so it fits a 16 GB card with room for short context and runs comfortably on a 24 GB 3090. The catch: it’s a dense 22B, so it’s bandwidth-bound and slower than the MoE models everyone’s switched to.

RTX 4060 Ti 16GBUsed RTX 3090 24GBRTX 4090 24GB
Best forQ4_K_M, tight budgetThe sweet spotSpeed + long context
Price (Jun 2026)~$430 new~$1,070 used avg~$2,000+ used
Memory bandwidth288 GB/s936 GB/s1,008 GB/s
Codestral 2 Q4_K_M speed~18–22 tok/s~40–50 tok/s~60–75 tok/s
The catchBandwidth-starvedBest $/tok, runs hotOverkill for one model

Honest take: If you want Codestral 2 specifically and you’re buying, a used RTX 3090 is the obvious pick — it has the bandwidth to make a dense 22B usable and the headroom to push context past the point a 16 GB card chokes. But before you commit, ask whether you actually need this model or just a good local coding model, because the MoE options are faster.

What changed: the license, not the weights

Codestral’s original 22B release in 2024 shipped under the Mistral Non-Production License — you could play with it, but you could not legally use it inside a commercial product or paid service. That single clause kept it off most real dev stacks.

In April 2026, Mistral relicensed Codestral 2 under Apache 2.0. That removes the non-production restriction entirely: you can run it inside a paid product, ship it in a closed-source tool, fine-tune it and sell the result, no permission needed. For a coding model that’s the whole ballgame — it’s the biggest open-source coding license unlock since Llama 2 went commercial.

The model itself is a 22B dense transformer with a 256K context window — the largest context of any dedicated open coding model — fill-in-the-middle (FIM) support for IDE autocomplete, and coverage of 80+ programming languages. Mistral reports 86.6% on HumanEval. That’s a strong single-file completion score, though HumanEval is a saturated benchmark in 2026 and shouldn’t be read as a ranking against the latest agentic coders.

The number that decides everything: 13.3 GB

The practical question isn’t “how good is it” — it’s “does it fit, and how fast.” Codestral 2 is a dense 22B, which means every token read needs all the active weights pulled from VRAM. There’s no MoE sparsity hiding most of the model. That makes its memory footprint predictable and its speed a straight function of bandwidth.

Here are the real GGUF sizes from the community quants (bartowski’s widely used build), which range from 6.64 GB at the smallest to 23.64 GB at Q8:

QuantFile sizeFits 12 GB?Fits 16 GB?Fits 24 GB?
Q4_K_M13.3 GBNo (with context)Yes (tight)Yes
Q5_K_M~15.7 GBNoYes (very tight)Yes
Q6_K~18.3 GBNoNoYes
Q8_0~23.6 GBNoNoBarely

Q4_K_M is the one almost everyone runs. At 13.3 GB the weights alone leave about 2.7 GB free on a 16 GB card — enough for the KV cache at a few thousand tokens of context, but nowhere near enough to exploit the 256K context window. That context number is a server/API capability; on a 16 GB consumer card you’ll be living at 8K–16K context, and even a 24 GB card runs out of room long before 256K. (If you slam into the wall, our CUDA out of memory fixes walk through the KV-cache and context knobs that buy you headroom.)

Speed: where dense bites you

Decode speed on a local LLM is governed by memory bandwidth, not raw compute — the GPU spends its time waiting on weights, not doing math. For a 13.3 GB model the theoretical ceiling is bandwidth ÷ model size, and real-world throughput lands at roughly half that after KV-cache reads and overhead.

That math plays out cleanly across the three cards worth considering:

  • RTX 4060 Ti 16GB (288 GB/s): This is the bottleneck card. A comparable 24B dense model (Mistral Small 3.2) was independently clocked at about 18.5 tok/s on 16 GB hardware — and Codestral 2 lands in the same ~18–22 tok/s range. Usable for autocomplete and short edits, sluggish for anything that streams a long answer.
  • Used RTX 3090 (936 GB/s): More than 3× the bandwidth of the 4060 Ti, and it shows. Expect roughly 40–50 tok/s at Q4_K_M — comfortably past reading speed (~7–10 tok/s), so generations feel responsive. This is the card the model is happiest on.
  • RTX 4090 (1,008 GB/s): A dense 32B at Q4 lands near 60 tok/s here, and the 4090 runs about 20% faster than a 3090 on 30B-class models, so a 22B comes in around 60–75 tok/s. Fast, but you’re paying roughly double a 3090 for a model that doesn’t need it.

The honest framing: on bandwidth-per-dollar, the used 3090 wins decisively for Codestral 2. The 4060 Ti makes it run; the 3090 makes it pleasant.

Running it: Ollama and llama.cpp

The fastest path is Ollama. Pull the model and point your editor at it:

ollama pull codestral
ollama run codestral "Write a Python function to debounce calls with a configurable delay"

For FIM autocomplete inside your editor, Ollama exposes the completion endpoint on localhost:11434. Pair it with Continue.dev + Ollama for an in-IDE setup that uses Codestral 2 for both chat and tab-completion.

If you want explicit control over quant and context with llama.cpp:

# Grab the Q4_K_M GGUF (13.3 GB), then:
llama-server -m Codestral-22B-v0.1-Q4_K_M.gguf \
  -ngl 99 \
  -c 16384 \
  --host 0.0.0.0 --port 8080

-ngl 99 offloads all layers to the GPU — essential, because partial CPU offload on a dense 22B tanks throughput. -c 16384 sets a realistic 16K context; don’t reach for 256K on consumer VRAM, the KV cache will OOM you instantly.

Codestral 2 vs the models that overtook it

Here’s the part the marketing won’t tell you: in mid-2026, dense models lost the local-coding crown to MoE. A Mixture-of-Experts model with 30B+ total parameters but only 3B active per token reads far less from VRAM per step, so it runs faster than a dense 22B while often coding better.

That’s the real competition for Codestral 2:

  • Qwen3-Coder-Next — Alibaba’s MoE coding agent, faster decode at similar quality, also open-weight.
  • Devstral Small 2 — Mistral’s own agentic coding model, built for multi-file/tool-use workflows Codestral wasn’t designed for.

So why run Codestral 2 at all? Three reasons that still hold:

  1. The license. Apache 2.0 with no usage ceiling is cleaner than some competitors’ terms if you’re shipping a product.
  2. FIM quality. Codestral was built around fill-in-the-middle; its autocomplete inside an editor is excellent and low-latency on a 3090.
  3. Predictability. A dense model’s VRAM and speed are dead simple to reason about — no expert-routing surprises, no “why did my MoE just slow down” debugging.

If you’re picking a local coding stack from scratch, read our best local coding LLM comparison first — Codestral 2 is a strong FIM autocomplete engine, but it’s no longer the default chat/agent pick. For a broader look at how MoE changed the speed math, Qwen3.6 35B-A3B and friends tell the story.

No GPU? Rent before you buy

If you don’t have a 16 GB+ card yet and want to try Codestral 2 before spending $430–$1,070, rent an hour of a 24 GB GPU on RunPod. A 24 GB instance runs a few cents to ~$0.40/hour depending on the card, which is enough to load the Q4_K_M GGUF, wire it into your editor, and judge whether the FIM autocomplete is worth buying hardware for. It’s the cheapest way to settle the “is dense 22B fast enough for me” question without a return-window gamble.

FAQ

Is Codestral 2 actually free for commercial use now? Yes. As of the April 2026 relicense, Codestral 2 is Apache 2.0 — you can use it in paid products, closed-source tools, and fine-tuned derivatives with no separate license from Mistral. The original 2024 Codestral was Non-Production only.

What’s the minimum GPU to run Codestral 2? A 16 GB card (RTX 4060 Ti 16GB, RTX 4070 Ti Super, etc.) runs Q4_K_M at 8K–16K context. Below 16 GB you’re forced into CPU offload, which makes a dense 22B painfully slow. For a usable experience, 24 GB (RTX 3090) is the real floor.

Can I use the full 256K context locally? Not on a consumer card. 256K is a server/API capability; the KV cache for that much context dwarfs the model itself and overflows even a 24 GB GPU. Plan for 8K–32K context locally depending on your card.

Codestral 2 or Devstral Small 2? Codestral 2 for fast FIM autocomplete and single-file completion. Devstral Small 2 for agentic, multi-file, tool-using workflows — it was purpose-built for that and Codestral wasn’t.

How fast is it on a Mac? Apple Silicon is bandwidth-rich, so a dense 22B runs reasonably on an M-series chip with 32 GB+ unified memory. Speed tracks the chip’s memory bandwidth the same way it does on a GPU; an M4 Max (546 GB/s) lands between a 3090 and 4090.

The hardware referenced in this guide:

Sources

Last updated June 20, 2026. Prices and specs change; verify current rates before purchasing.

Was this article helpful?