May 30, 2026

Devstral Small 2 for Local AI in 2026: Which GPU Runs Mistral's Best Open-Source Coding Model?

By RunAIHome Team · 13 min read

local-llmmistralcodinggpudevstralinferenceollama

TL;DR: Devstral Small 2 is a 24B open-source coding model from Mistral that scores 68% on SWE-bench Verified — competitive with models five times its size. At Q4_K_M quantization it needs ~14–15 GB of VRAM, putting it squarely within reach of any 24 GB GPU. The RTX 3090 bought used is the value winner; the RTX 4090 gives you headroom for Q8 and long contexts.

	RTX 3090 24GB (used)	RTX 4090 24GB	RTX 5060 Ti 16GB	Mac Mini M4 Pro 24GB
Best for	Max VRAM per dollar	Fastest NVIDIA option	Tight but functional	Silent, efficient, no driver pain
Q4_K_M fit?	Yes (room to spare)	Yes (room to spare)	Tight (~14–15 GB leaves ~1–2 GB)	Yes
Q8 fit?	Yes (~25–26 GB)	Yes (~25–26 GB)	No	Depends on config
Token speed	~33–44 tok/s	~40–50 tok/s (est.)	~16–21 tok/s (est.)	~20–30 tok/s
Approx. cost	~$800–$1,050 used	~$1,599+ new	~$429 new	$1,399

Honest take: If you already own an RTX 3090 or RTX 4090, Devstral Small 2 runs today with one command. If you’re buying new, the RTX 3090 remains the single smartest 24 GB purchase specifically because this model (and every other 20–24 GB GGUF you’ll run) rewards raw VRAM over everything else.

What Devstral Small 2 actually is

Mistral released Devstral Small 2 on December 9, 2025, alongside the larger 123B Devstral 2 model. The Small 2 variant packs 24 billion parameters into an Apache 2.0 license — meaning you can fine-tune, redistribute, and run it commercially without a royalty or a lawyer.

The headline number: 68.0% on SWE-bench Verified. SWE-bench measures how often a model can resolve real GitHub issues in production codebases — the closest public proxy for “does this thing actually write working code.” A 24B model hitting 68% puts it ahead of many 70B-class models and within striking distance of GPT-4o-era frontier performance.

Three other details matter for local deployment:

256K-token context window. You can feed it an entire large codebase, not just a single file.
Multimodal. The model accepts image inputs, so it can handle screenshots of error messages or UI mockups.
Agentic by design. Devstral Small 2 is explicitly tuned for multi-step agent loops — plan, read file, write patch, test, repeat — rather than single-turn code generation.

If you have been running the original Devstral (22B, released May 2025) from the Ollama library, Small 2 is a 24B successor that improves on the agentic tooling substantially.

VRAM breakdown: Q4 vs Q8

The 24B parameter count translates directly into memory requirements. Here is what the quantization math looks like for Devstral Small 2:

Quantization	Model weights	Typical KV cache (8K ctx)	Minimum VRAM
Q4_K_M	~14.5 GB	~0.5 GB	~15 GB
Q5_K_M	~17.5 GB	~0.5 GB	~18 GB
Q8_0	~25.5 GB	~0.5 GB	~26 GB
FP16 (full)	~48 GB	—	Multi-GPU only

What this means card-by-card:

16 GB cards (RTX 5060 Ti 16GB, RTX 4060 Ti 16GB): Q4_K_M technically loads, but with only 1–2 GB to spare the KV cache for long code files will overflow to system RAM. Short-context tasks (review this function, write a unit test) work. Long-context agent sessions — where the model reads 20 files and plans a multi-step refactor — will stutter or fail.
24 GB cards (RTX 3090, RTX 4090, RTX 3090 Ti): Q4_K_M runs with ~9 GB of headroom, enough for the full 256K context at reasonable lengths. Q8_0 also fits, giving you higher output quality and sharper reasoning on complex code.
Apple Silicon unified memory: Any Mac config with 24 GB+ handles Q4_K_M cleanly. The Mac Mini M4 Pro with 24GB gives you exactly this at $1,399.

For context on the quality difference between Q4 and Q8, our article on Q4 vs Q8 quantization quality loss found real but not catastrophic degradation at Q4 — acceptable for interactive coding tasks, noticeable on multi-hop reasoning chains.

GPU-by-GPU breakdown

RTX 3090 24GB (used) — the value pick

The RTX 3090 runs Devstral Small 2 Q4_K_M at roughly 33–44 tokens per second via Ollama, based on hardware testing reported by hardware-corner.net. That speed is more than comfortable for interactive coding: watching tokens generate at 35 tok/s feels like a fast typist, not a loading bar.

The 3090’s 936 GB/s memory bandwidth is the engine here. LLM inference in the GGUF path is almost entirely memory-bandwidth-limited; every additional GB/s translates nearly linearly to faster tokens.

Used pricing in May 2026 sits at $800–$1,050 on eBay, depending on the blower vs. fan variant and OC factory clocks. The Q8 load (~25–26 GB) also fits on a 3090 with minimal overhead, so you are not sacrificing output quality to save money on the hardware.

We covered the 3090’s broader value proposition in detail in Used RTX 3090 in 2026: Still the AI Value King? — the conclusion still holds for Devstral Small 2.

RTX 4090 24GB — fastest NVIDIA consumer GPU

The RTX 4090 has 1008 GB/s memory bandwidth — about 8% higher than the 3090 — and runs 14B models at ~69 tok/s in direct benchmarks. Scaling proportionally to a 24B model, you can expect roughly 40–50 tok/s on Devstral Small 2 Q4_K_M. (These are estimated figures from memory bandwidth scaling; direct Devstral Small 2 benchmarks on the 4090 were not available at time of writing.)

What the 4090 adds beyond raw speed: it handles Q8_0 just as easily as Q4, and the additional VRAM headroom matters as soon as you point an agent at a large codebase with deep context. If you are running Devstral Small 2 inside a coding agent that reads 30–40 files per session, that headroom is not optional.

New 4090 cards are currently priced above $1,599 at retail. That is a significant premium over a used 3090 for a ~20% speed gain on this model specifically.

RTX 5060 Ti 16GB — technically works, in theory

The RTX 5060 Ti 16GB can load Devstral Small 2 Q4_K_M (~14.5 GB), but the margin is razor thin. We have covered this card extensively — see RTX 5060 Ti 8GB vs 16GB and RTX 5070 vs RTX 5060 Ti — and the pattern is consistent: for any model that truly needs 24 GB to breathe, 16 GB is a painful ceiling.

The 5060 Ti 16GB delivers 448 GB/s memory bandwidth — 48% of the RTX 3090’s 936 GB/s. Scale the 3090’s 33–44 tok/s by that ratio and you land at roughly 16–21 tok/s (estimated from bandwidth; measured Devstral benchmarks on the 5060 Ti were not available at time of writing). Practically, you will hit the KV-cache ceiling before you notice the speed difference.

Verdict: if you already own a 5060 Ti 16GB, run Devstral Small 2 at Q4_K_M for short-context sessions and it will work. If you are buying specifically to run this model, the 5060 Ti is the wrong purchase.

RTX 5070 12GB — skip for this model

The RTX 5070’s 12 GB VRAM cannot load Q4_K_M (~14.5 GB). Full stop. Even with aggressive offloading to system RAM, the resulting inference speed becomes unusable for a coding agent. We already made this argument for Mistral-class 24B models in the RTX 5070 vs RTX 5060 Ti comparison: more bandwidth does not help when the model does not fit.

Apple Silicon: quiet, efficient, capable

Apple Silicon handles Devstral Small 2 well because unified memory eliminates the VRAM ceiling that trips up 16 GB discrete cards.

The Mac Mini M4 Pro with 24 GB runs 24B models at 20–30 tok/s via Ollama using Metal acceleration. Its 273 GB/s memory bandwidth is lower than the RTX 3090’s 936 GB/s, which explains the speed gap — but 20+ tok/s is still usable for interactive coding sessions, and the M4 Pro draws only 30–40 W under AI load, versus the 3090’s 350 W.

The 36 GB M4 Pro configuration ($1,599) loads Q8 comfortably and adds headroom for large-context agent sessions. If you already have a Mac and want to avoid a dedicated GPU rig, Devstral Small 2 is a strong argument for sticking with what you have.

Running it with Ollama

Ollama is the fastest path to a running Devstral Small 2. With Ollama installed:

# Q4_K_M — fits 16 GB cards (tight) and all 24 GB setups
ollama pull devstral-small-2

# Q8 — for 24 GB+ cards when you want better reasoning quality
ollama pull devstral-small-2:24b-instruct-q8_0

# Sanity check
ollama run devstral-small-2 "Write a Python function that retries a flaky network call with exponential backoff."

Check the current model tag at ollama.com/library before pulling — Ollama tags can update between minor releases.

For llama.cpp users, bartowski maintains GGUF builds on Hugging Face at bartowski/mistralai_Devstral-Small-2-24B-Instruct-2512-GGUF. The Q4_K_M file is ~14.7 GB; Q8_0 is ~25.6 GB.

If you want a web UI alongside Ollama, our vLLM vs Ollama comparison walks through when the single-user Ollama path is sufficient versus when vLLM’s concurrent-request handling matters.

Connecting to a VS Code coding agent

The practical value of Devstral Small 2 over a general-purpose model like Llama 3.3 70B is its agentic tuning. It behaves well inside tool-call loops, reads file contents without hallucinating paths, and produces diffs rather than inline explanations.

To use it from VS Code via Continue.dev:

# config.yaml
models:
  - title: Devstral Small 2 (local)
    provider: ollama
    model: devstral-small-2
    apiBase: http://localhost:11434

For Cline, set the model to devstral-small-2 in the OpenAI-compatible provider pointing at http://localhost:11434/v1.

Our full local coding setup walkthrough is at Setting Up a Local AI Coding Stack with Continue.dev + Ollama. For a broader comparison of AI coding tools including Cursor, Windsurf, and Copilot, our sister site aicoderscope.com covers that landscape in detail.

No GPU yet? RunPod lets you spin up an RTX 4090 instance to test Devstral Small 2 before committing to hardware — useful for validating whether the agentic workflow fits your codebase before spending $1,000+ on a 3090.

How it ranks against other coding models

SWE-bench Verified is the clearest public benchmark for real-world coding ability:

Model	Parameters	SWE-bench Verified	Runs locally?
Claude Sonnet 4.6	API-only	~79.6%	No
Gemini 2.5 Pro	API-only	~73.1%	No
Qwen3-Coder-Next	—	~70.6%	Limited (very large)
Devstral Small 2	24B	68.0%	Yes, 24 GB GPU
Qwen 2.5 Coder 32B	32B	~65%	Yes, needs 24 GB+ for Q4
Llama 3.3 70B	70B	~57%	Needs 48+ GB or offload

The takeaway: Devstral Small 2 is the strongest model you can run on a single 24 GB consumer GPU for code-specific tasks. Qwen 2.5 Coder 32B is a close alternative at slightly lower quality and higher VRAM demand; Llama 3.3 70B offers broader general knowledge but lower raw SWE-bench performance at much higher hardware cost.

For a broader view of what models fit what VRAM tier, see Best Local AI Models by VRAM Tier.

What you’d need for Devstral 2 (123B)

The big sibling — Devstral 2 at 123B parameters — is where the benchmark ceiling sits for Mistral’s open-weight lineup. The hardware requirement is genuinely punishing:

Q4 quantization alone requires ~64–72 GB VRAM to load with minimal context.
Q8 full precision needs ~128 GB, putting it in the territory of four RTX 3090s or a workstation with enterprise memory.
A dual RTX 5090 setup (2 × 32 GB = 64 GB) covers Q4 at short context, but the RTX 5090’s current pricing makes this an expensive experiment.

For most home-lab users, Devstral Small 2 is the practical ceiling. The 123B version belongs to cloud GPU providers or researchers with serious multi-GPU rigs.

Frequently Asked Questions

Does Devstral Small 2 work on an 8 GB or 12 GB GPU? No. The Q4_K_M quantization requires ~14–15 GB of VRAM. An RTX 4060 8GB or RTX 5070 12GB cannot load it without heavy CPU offloading, which drops token speed to unusable levels (under 5 tok/s) on a typical DDR5 system.

What is the difference between Devstral and Devstral Small 2? The original Devstral (May 2025, 22B) was Mistral’s first coding-agent model. Devstral Small 2 (December 2025, 24B) is the second-generation small version — higher SWE-bench score, longer 256K context, multimodal inputs, and better agentic tool-call behavior. They use different Ollama tags; check the library page for current identifiers.

Can Devstral Small 2 replace GitHub Copilot for local development? For single-file autocomplete and short chat tasks, it is competitive. For agentic workflows — where the model browses files, proposes diffs, and runs iterative fixes — it is arguably better than Copilot’s basic tier. The caveat is that local inference at 30–44 tok/s feels slower than a cloud autocomplete; you trade latency for privacy and zero API costs.

Is the Apache 2.0 license actually commercial-use friendly? Yes. Apache 2.0 allows commercial use, modification, distribution, and sublicensing without royalty. You can build a product on top of Devstral Small 2, ship it to customers, and charge for it. Verify with Mistral’s model card for any usage restrictions added after the initial release.

How does it compare to Mistral Small 4 (the 119B MoE model)? They are fundamentally different deployment profiles. Mistral Small 4 is a mixture-of-experts 119B model — enormous in total parameters, moderate in active parameters per forward pass — and requires significantly more VRAM than Devstral Small 2. For hardware comparison details see our Mistral Small 4 hardware guide. For pure coding tasks on a single 24 GB GPU, Devstral Small 2’s specialized training wins.

Sources

Last updated May 30, 2026. Prices and specs change; verify current rates before purchasing.

Recommended Gear

RTX 3090 24GB — best value 24 GB GPU for Devstral Small 2
RTX 4090 24GB — fastest NVIDIA consumer option
RTX 5060 Ti 16GB — budget option; tight Q4_K_M fit only
RTX 5090 — only relevant for running the 123B Devstral 2 variant
Mac Mini M4 Pro — silent, efficient, 24 GB unified memory handles Q4_K_M and Q8

Was this article helpful?