Mistral Small 4 for Local AI in 2026: The 119B MoE Hardware Reality
TL;DR: Mistral Small 4 is a 119B MoE model with 6B active parameters per token—GPT-4-class in coding and reasoning, multimodal, and fully open-weight. The problem is Q4_K_M quantization lands at ~74 GB, so no single consumer GPU gets you there. Your two realistic local paths are three RTX 4090s (GPU cost alone: ~$3,300–5,500 depending on new vs. used) or a Mac Studio M3 Ultra with 96 GB ($3,999). For most readers, the Mistral API at $0.15/M input tokens removes all of this friction.
| 3× RTX 4090 | Mac Studio M3 Ultra 96 GB | RunPod H100 PCIe | |
|---|---|---|---|
| Best for | Privacy-first, high-volume inference | Silent desk setup, macOS workflow | Occasional bursts without sunk cost |
| Upfront cost | ~$3,300–5,500 (GPUs only) | $3,999 | $0 |
| Ongoing cost | ~$0.16–0.18/hr electricity | ~$0.04/hr electricity | $1.99/hr |
| Speed at Q4_K_M | 22–32 tok/s | ~8–12 tok/s | 60–80 tok/s |
| The catch | Three-slot motherboard + 1,600 W PSU | M3 Ultra only; M4 Max 64 GB won’t fit | Data leaves your machine |
Honest take: Unless you’re generating 10M+ tokens per month with hard privacy requirements, the API at $0.15/M input tokens is cheaper and faster than any consumer hardware setup for this model. The local path here requires real justification.
What Mistral Small 4 actually is
Released March 2026 (model version: mistral-small-4-119b-2603), Mistral Small 4 is Mistral AI’s first public Mixture-of-Experts model. The name is confusing—“Small” refers to its inference cost, not its parameter count.
The architecture: 119 billion total parameters, 128 experts, 4 active per token. That means the model activates just 6 billion parameters per forward pass (8 billion if you include embedding and output layers). In practice, inference compute roughly matches a 6–8B dense model, not the headline 119B. This is the same efficiency trick DeepSeek V3 and Llama 4 Scout use—you store a large model in memory, but think with a fraction of it.
What Mistral consolidated into this release:
- Magistral (their reasoning model) → Mistral Small 4 matches it on math and multi-step logic
- Devstral (their coding agent model) → Mistral Small 4 has the same code execution profile
- Mistral Small 3.x (their instruct model) → now superseded for general chat
Context window: 262,144 tokens. Modalities: text and image. License: Apache 2.0.
The 40% latency improvement and 3× throughput gain over Mistral Small 3 come entirely from the MoE switch—you’re doing 6B worth of FLOPS per token instead of 24B. The model is larger in memory, but faster when memory bandwidth isn’t the bottleneck (i.e., on professional hardware with NVLink or high-bandwidth interconnects).
The benchmark story
Mistral’s published numbers for Small 4:
| Benchmark | Mistral Small 4 | GPT-OSS 120B | Mistral Small 3.2 24B |
|---|---|---|---|
| MMLU Pro | 78 | ~76 | 71 |
| LiveCodeBench | 64 | 63 | 53 |
MMLU Pro at 78 places it in the same tier as GPT-4-class models for general knowledge. LiveCodeBench at 64—beating the competing 120B dense model at 63—is the more meaningful number for the home lab audience: this is the model you’d reach for when you want GPT-4o-level coding help running locally, with 262K context and vision input.
For direct comparison, Mistral Small 3.2 (the 24B dense model covered in our Llama 3.3 vs Qwen3 vs Mistral comparison) scored 53 on LiveCodeBench. Mistral Small 4 is not a slight upgrade—it’s a different product tier.
The honest benchmark caveat: these numbers come from Mistral’s release materials. Third-party evaluations on specific tasks vary, and the coding lead over dense models narrows when token budgets are short (MoE excels at reasoning-heavy tasks where many experts contribute).
The hardware math: quantization options
Mistral Small 4’s GGUF quantization file sizes (bartowski builds, as of May 2026):
| Quantization | File size | VRAM needed (weights alone) | Quality vs FP16 |
|---|---|---|---|
| Q2_K | ~45 GB | ~46–48 GB | Significant degradation |
| Q3_K_M | ~52 GB | ~54–56 GB | Noticeable degradation on reasoning |
| Q4_K_M | ~74 GB | ~76–78 GB | Recommended minimum |
| Q5_K_M | ~89 GB | ~91–93 GB | Near-lossless |
| FP16 | ~244 GB | ~250+ GB | Reference quality |
The NVFP4 checkpoint (Mistral’s own format) is designed to slot into a single H100 80 GB for cloud deployment. For local GGUF-based inference via llama.cpp or Ollama, Q4_K_M is the realistic floor where quality doesn’t obviously degrade on coding and reasoning tasks.
VRAM needed is weights + ~2–4 GB OS/framework overhead + KV cache. KV cache scales with context length: at 8K context it’s cheap (~2–3 GB for most quantization levels); at the full 262K context window it gets expensive. Plan your headroom accordingly.
The consumer GPU path
Single RTX 4090 (24 GB): skip it
At 24 GB VRAM, a single RTX 4090 holds only 30–35 of Q4_K_M’s transformer layers in VRAM. The rest—roughly 65%—spills to system RAM. With 64 GB DDR5 as overflow, expect 5–10 tokens/second. Interactive chat at that speed is painful. You’d spend the same money more usefully on cloud inference.
Even an RTX 5090 (32 GB VRAM) misses Q4_K_M by more than 40 GB. The 5090 shines at 32B-and-under models; Mistral Small 4 simply isn’t in its wheelhouse.
Two RTX 4090s (48 GB combined): Q2_K only, usable
Two 4090s connected via PCIe tensor parallelism gives you 48 GB of combined VRAM. Q2_K at ~45–48 GB fits—just—with minimal CPU spillover. On the Q2_K build, expect 14–20 tok/s with 32 GB of DDR5 as buffer memory.
The problem is Q2_K quality. On reasoning and coding benchmarks, 2-bit quantization of a 119B MoE model introduces visible degradation. You’re paying the full hardware premium for a meaningfully worse model. If you can afford two 4090s, read the next section.
The multi-GPU wiring specifics—NVLink vs. PCIe tensor parallelism, which frameworks support it—are covered in our multi-GPU local AI guide.
Three RTX 4090s (72 GB combined): Q4_K_M, the real entry point
Three RTX 4090s give 72 GB of combined VRAM. Q4_K_M at ~74 GB is slightly over—you’ll still have a few GB spilling to RAM—but effective throughput jumps to 22–32 tok/s, which is usable for interactive work and background batch tasks.
What this actually costs (May 2026):
- RTX 4090 used (eBay completed listings): ~$1,099 each → 3× = ~$3,300
- RTX 4090 new (Amazon): ~$2,755 each → 3× = ~$8,265 (production ended October 2024; prices reflect remaining inventory)
- Realistic used build with 3× GPUs: $3,300–4,500 for the GPUs
Add to that:
- HEDT or server platform motherboard with 3+ PCIe x16 slots: ~$400–700
- PSU: three 4090s draw ~450 W each at full load; plan for 1,800 W total system draw, which means a 2,000 W PSU minimum: ~$350–500
- CPU, RAM (64 GB minimum), NVMe: ~$600–900
Total realistic build: $5,000–7,000 all-in for a three-4090 Mistral Small 4 rig. Not cheap, and you’re dealing with hardware from a discontinued GPU generation that’s no longer under warranty when purchased used.
Power draw reality: at 1,600 W continuous draw (80% load), you’re adding ~$0.16–0.19/hour in electricity at US average rates ($0.12–0.15/kWh). Running 8 hours per day, that’s ~$40–50/month just in power.
The Apple Silicon path
Apple Silicon’s unified memory makes it the most straightforward consumer hardware for oversized MoE models. The memory bandwidth is lower than multi-4090 setups, but you don’t need PCIe topology negotiations, and the thermal situation is manageable in a desktop form factor.
Mac Studio M4 Max with 96 GB (configure-to-order): slower but fits
The Mac Studio M4 Max can be configured with up to 96 GB of unified memory as a configure-to-order option (check current pricing at apple.com/shop/buy-mac/mac-studio). At ~74 GB for model weights plus OS overhead, you have ~18–20 GB for KV cache—comfortable for 8K–32K context lengths, tight at 128K+.
The bandwidth limitation matters here. The top-tier M4 Max (16-core CPU, 40-core GPU) delivers 546 GB/s of unified memory bandwidth; the base M4 Max (14-core) is 410 GB/s. Theoretical throughput for a 74 GB model at 546 GB/s: 546 / 74 ≈ 7.4 tok/s maximum before other bottlenecks—lower on the 410 GB/s base variant (~5.5 tok/s). Real-world numbers via MLX land around 6–9 tok/s on the top-tier chip.
Use MLX, not llama.cpp, for Apple Silicon. It’s purpose-built for Metal and consistently outperforms llama.cpp for inference throughput on these chips.
Mac Studio M3 Ultra with 96 GB ($3,999): the better Apple path
The Mac Studio M3 Ultra base configuration ($3,999, 96 GB, 1 TB SSD) offers 819 GB/s of unified memory bandwidth—50% more than M4 Max. Theoretical ceiling for Mistral Small 4 Q4_K_M: 819 / 74 ≈ 11.1 tok/s. Practical MLX results fall around 8–12 tok/s.
That’s still not exciting compared to a properly configured multi-4090 NVIDIA build, but it’s a clean single-machine setup with no PCIe slot drama, no 2,000 W PSU, and no NVLink compatibility research. The M3 Ultra also runs significantly quieter under sustained load.
If you need longer contexts (64K+), step up to the M3 Ultra 256 GB configuration (~$5,999 as of May 2026 after Apple’s price increase on the upgrade). The 256 GB variant gives Mistral Small 4 room for multi-document workflows at the full 262K context window.
For a deeper look at how Mac Studio stacks up against dual-RTX-4090 builds for models that actually do fit on consumer GPUs, see our Mac Studio M3 Ultra vs dual RTX 4090 comparison.
The cloud alternative: RunPod and the API
For most readers, neither a three-4090 build nor a $3,999 Mac Studio is the right call for this specific model.
RunPod H100 PCIe at $1.99/hr
A single H100 PCIe (80 GB VRAM) on RunPod runs Mistral Small 4’s NVFP4 checkpoint natively—the format Mistral AI designed specifically for H100 80 GB. Expected throughput: 60–80 tok/s, far beyond any consumer setup.
At $1.99/hr for spot instances, the math for occasional use is straightforward: 100 hours of inference per month costs $199, with zero upfront investment. That same $199/month amortized over a $7,000 three-4090 build takes nearly 3 years to break even—and that’s before accounting for electricity, depreciation, and the reality that a newer, cheaper GPU will exist in 18 months.
For sustained-use operations (running a team inference server 8+ hours per day), the economics shift. A dedicated H100 rental at current rates runs ~$1,432/month, while a purchased three-4090 build starts looking cheaper over 18–24 months if your workload justifies it. See our RunPod vs local GPU breakdown for the full cost model.
The Mistral API at $0.15/$0.60 per M tokens
Mistral’s hosted API for Small 4 is priced at $0.15 per million input tokens and $0.60 per million output tokens. For comparison:
| Monthly token volume | API cost | RunPod H100 (100 hrs) | Triple-4090 build (amortized over 24 mo) |
|---|---|---|---|
| 5M tokens | $3–4.50 | $199 | ~$292 |
| 50M tokens | $30–45 | $199 | ~$292 |
| 100M tokens | $60–90 | $199 | ~$292 |
| 500M tokens | $300–450 | ~$600 (300 hrs) | ~$292 + electricity |
| 1B tokens | $600–900 | ~$1,200 (600 hrs) | ~$292 + electricity |
The API wins decisively for anything under ~300M tokens/month. Above that, a dedicated cloud GPU or owned hardware starts making sense—and only if your workload is steady rather than bursty.
Who should actually run Mistral Small 4 locally
Good local-run candidates:
- Privacy-first organizations where medical, legal, or personal data can’t leave the building
- Development teams with high, predictable token volume (1B+ tokens/month) and a budget for dedicated GPU infra
- Researchers who need the full 262K context window with image inputs for multi-document analysis
- Anyone who already has a Mac Studio M3 Ultra 96 GB for other work and wants to try it at no marginal cost
Bad local-run candidates:
- Solo developers doing occasional coding help → use the API, spend $5/month
- Home lab enthusiasts who want to “run the biggest model” → you’ll get 8 tok/s on an expensive machine; Qwen3 72B on a dual 3090 is faster and more practical for daily use
- Anyone without a multi-slot HEDT motherboard and a 2,000 W PSU → the GPU costs are the smaller problem
If you’re looking for a model that actually makes sense on a $1,000–2,000 single-GPU home lab build, Mistral Small 3.2 24B (Q4_K_M: ~13 GB, fits on an RTX 4060 Ti 16 GB) delivers strong coding results and runs at 70–90 tok/s on a 4090. That’s the sweet spot for most home lab users.
Frequently Asked Questions
Can I run Mistral Small 4 on a single RTX 4090? Technically yes, but you’ll spend most of the model in system RAM. With Q4_K_M (~74 GB), roughly 65% of the model spills from the 4090’s 24 GB VRAM to DDR5. Realistic inference speed: 5–10 tok/s. For interactive use that’s unusable; for overnight batch jobs it might be acceptable if you have 96+ GB of fast DDR5.
Is Mistral Small 4 actually “small”? The name refers to its inference cost, not its stored size. With 128 experts and only 4 active per token, each forward pass computes roughly 6–8B parameters—equivalent to a small dense model. The 119B total live in memory but most are idle on any given token.
Does Mistral Small 4 work with Ollama?
As of May 2026, Ollama supports Mistral Small 4 via the GGUF format. Pull it with ollama pull mistral-small4:latest or specify a quantization level. Note that Ollama won’t automatically use tensor parallelism across multiple GPUs; for multi-GPU inference you’ll need llama.cpp with --n-gpu-layers tuning or vLLM.
What’s the minimum Apple Silicon Mac that can run this model? The Mac Studio M4 Max with 96 GB is the lowest-cost Apple Silicon option that fits Q4_K_M without CPU offload. The Mac Mini M4 Pro (max 64 GB) and Mac Studio M4 Max at 64 GB both fall short—you’d be stuck at Q2_K, which degrades reasoning quality noticeably.
How does Mistral Small 4 compare to Llama 4 Scout for local inference? Llama 4 Scout (109B MoE, ~3.2B active parameters) is lighter on inference compute and fits more easily in memory at lower quantization levels. Mistral Small 4’s 6B active parameters per token makes it more capable on reasoning-heavy tasks, but the memory footprint is similar. The choice depends on your specific workload: Scout handles coding well with a lower memory floor; Small 4 has a slight edge in instruction-following fidelity and vision quality.
Sources
- Introducing Mistral Small 4 — Mistral AI
- Mistral Small 4 VRAM Requirements — Will It Run AI Blog
- Mistral Small 4 Local Setup: The 119B MoE Hardware Reality — CraftRigs
- Mistral Small 4 Is Free — But Running It Locally Will Cost You $10,000 — CraftRigs
- bartowski/mistralai_Mistral-Small-4-119B-2603-GGUF — Hugging Face
- Mistral Small 4 — API Pricing & Benchmarks — OpenRouter
- Mac Studio 2025 — Technical Specifications — Apple Support
- M4 Max Specifications — TinyMacs
- Apple Pulls $4,000 512 GB Mac Studio Upgrade Option — Tom’s Hardware
- NVIDIA GeForce RTX 4090 Price — May 2026 — GPU Poet
- Mistral AI API Pricing — AI Pricing Guru
- RunPod Pricing 2026 — CheckThat.ai
- Mistral Small 4: One Open-Source Model to Replace Three — Ten Invent Blog
- Mistral Small 4 Complete Guide and Benchmarks — Emelia.io
Last updated May 30, 2026. Prices and availability change frequently; verify current rates before purchasing.
Recommended Gear
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →