Mistral Small 4 for Local AI in 2026: The 119B MoE Hardware Reality

mistrallocal-aihardwaregpuvrammoebuying-guideapple-silicon

TL;DR: Mistral Small 4 is a 119B MoE model with 6B active parameters per token—GPT-4-class in coding and reasoning, multimodal, and fully open-weight. The problem is Q4_K_M quantization lands at ~74 GB, so no single consumer GPU gets you there. Your two realistic local paths are three RTX 4090s (GPU cost alone: ~$3,300–5,500 depending on new vs. used) or a Mac Studio M3 Ultra with 96 GB ($3,999). For most readers, the Mistral API at $0.15/M input tokens removes all of this friction.

3× RTX 4090Mac Studio M3 Ultra 96 GBRunPod H100 PCIe
Best forPrivacy-first, high-volume inferenceSilent desk setup, macOS workflowOccasional bursts without sunk cost
Upfront cost~$3,300–5,500 (GPUs only)$3,999$0
Ongoing cost~$0.16–0.18/hr electricity~$0.04/hr electricity$1.99/hr
Speed at Q4_K_M22–32 tok/s~8–12 tok/s60–80 tok/s
The catchThree-slot motherboard + 1,600 W PSUM3 Ultra only; M4 Max 64 GB won’t fitData leaves your machine

Honest take: Unless you’re generating 10M+ tokens per month with hard privacy requirements, the API at $0.15/M input tokens is cheaper and faster than any consumer hardware setup for this model. The local path here requires real justification.


What Mistral Small 4 actually is

Released March 2026 (model version: mistral-small-4-119b-2603), Mistral Small 4 is Mistral AI’s first public Mixture-of-Experts model. The name is confusing—“Small” refers to its inference cost, not its parameter count.

The architecture: 119 billion total parameters, 128 experts, 4 active per token. That means the model activates just 6 billion parameters per forward pass (8 billion if you include embedding and output layers). In practice, inference compute roughly matches a 6–8B dense model, not the headline 119B. This is the same efficiency trick DeepSeek V3 and Llama 4 Scout use—you store a large model in memory, but think with a fraction of it.

What Mistral consolidated into this release:

  • Magistral (their reasoning model) → Mistral Small 4 matches it on math and multi-step logic
  • Devstral (their coding agent model) → Mistral Small 4 has the same code execution profile
  • Mistral Small 3.x (their instruct model) → now superseded for general chat

Context window: 262,144 tokens. Modalities: text and image. License: Apache 2.0.

The 40% latency improvement and 3× throughput gain over Mistral Small 3 come entirely from the MoE switch—you’re doing 6B worth of FLOPS per token instead of 24B. The model is larger in memory, but faster when memory bandwidth isn’t the bottleneck (i.e., on professional hardware with NVLink or high-bandwidth interconnects).


The benchmark story

Mistral’s published numbers for Small 4:

BenchmarkMistral Small 4GPT-OSS 120BMistral Small 3.2 24B
MMLU Pro78~7671
LiveCodeBench646353

MMLU Pro at 78 places it in the same tier as GPT-4-class models for general knowledge. LiveCodeBench at 64—beating the competing 120B dense model at 63—is the more meaningful number for the home lab audience: this is the model you’d reach for when you want GPT-4o-level coding help running locally, with 262K context and vision input.

For direct comparison, Mistral Small 3.2 (the 24B dense model covered in our Llama 3.3 vs Qwen3 vs Mistral comparison) scored 53 on LiveCodeBench. Mistral Small 4 is not a slight upgrade—it’s a different product tier.

The honest benchmark caveat: these numbers come from Mistral’s release materials. Third-party evaluations on specific tasks vary, and the coding lead over dense models narrows when token budgets are short (MoE excels at reasoning-heavy tasks where many experts contribute).


The hardware math: quantization options

Mistral Small 4’s GGUF quantization file sizes (bartowski builds, as of May 2026):

QuantizationFile sizeVRAM needed (weights alone)Quality vs FP16
Q2_K~45 GB~46–48 GBSignificant degradation
Q3_K_M~52 GB~54–56 GBNoticeable degradation on reasoning
Q4_K_M~74 GB~76–78 GBRecommended minimum
Q5_K_M~89 GB~91–93 GBNear-lossless
FP16~244 GB~250+ GBReference quality

The NVFP4 checkpoint (Mistral’s own format) is designed to slot into a single H100 80 GB for cloud deployment. For local GGUF-based inference via llama.cpp or Ollama, Q4_K_M is the realistic floor where quality doesn’t obviously degrade on coding and reasoning tasks.

VRAM needed is weights + ~2–4 GB OS/framework overhead + KV cache. KV cache scales with context length: at 8K context it’s cheap (~2–3 GB for most quantization levels); at the full 262K context window it gets expensive. Plan your headroom accordingly.


The consumer GPU path

Single RTX 4090 (24 GB): skip it

At 24 GB VRAM, a single RTX 4090 holds only 30–35 of Q4_K_M’s transformer layers in VRAM. The rest—roughly 65%—spills to system RAM. With 64 GB DDR5 as overflow, expect 5–10 tokens/second. Interactive chat at that speed is painful. You’d spend the same money more usefully on cloud inference.

Even an RTX 5090 (32 GB VRAM) misses Q4_K_M by more than 40 GB. The 5090 shines at 32B-and-under models; Mistral Small 4 simply isn’t in its wheelhouse.

Two RTX 4090s (48 GB combined): Q2_K only, usable

Two 4090s connected via PCIe tensor parallelism gives you 48 GB of combined VRAM. Q2_K at ~45–48 GB fits—just—with minimal CPU spillover. On the Q2_K build, expect 14–20 tok/s with 32 GB of DDR5 as buffer memory.

The problem is Q2_K quality. On reasoning and coding benchmarks, 2-bit quantization of a 119B MoE model introduces visible degradation. You’re paying the full hardware premium for a meaningfully worse model. If you can afford two 4090s, read the next section.

The multi-GPU wiring specifics—NVLink vs. PCIe tensor parallelism, which frameworks support it—are covered in our multi-GPU local AI guide.

Three RTX 4090s (72 GB combined): Q4_K_M, the real entry point

Three RTX 4090s give 72 GB of combined VRAM. Q4_K_M at ~74 GB is slightly over—you’ll still have a few GB spilling to RAM—but effective throughput jumps to 22–32 tok/s, which is usable for interactive work and background batch tasks.

What this actually costs (May 2026):

  • RTX 4090 used (eBay completed listings): ~$1,099 each → 3× = ~$3,300
  • RTX 4090 new (Amazon): ~$2,755 each → 3× = ~$8,265 (production ended October 2024; prices reflect remaining inventory)
  • Realistic used build with 3× GPUs: $3,300–4,500 for the GPUs

Add to that:

  • HEDT or server platform motherboard with 3+ PCIe x16 slots: ~$400–700
  • PSU: three 4090s draw ~450 W each at full load; plan for 1,800 W total system draw, which means a 2,000 W PSU minimum: ~$350–500
  • CPU, RAM (64 GB minimum), NVMe: ~$600–900

Total realistic build: $5,000–7,000 all-in for a three-4090 Mistral Small 4 rig. Not cheap, and you’re dealing with hardware from a discontinued GPU generation that’s no longer under warranty when purchased used.

Power draw reality: at 1,600 W continuous draw (80% load), you’re adding ~$0.16–0.19/hour in electricity at US average rates ($0.12–0.15/kWh). Running 8 hours per day, that’s ~$40–50/month just in power.


The Apple Silicon path

Apple Silicon’s unified memory makes it the most straightforward consumer hardware for oversized MoE models. The memory bandwidth is lower than multi-4090 setups, but you don’t need PCIe topology negotiations, and the thermal situation is manageable in a desktop form factor.

Mac Studio M4 Max with 96 GB (configure-to-order): slower but fits

The Mac Studio M4 Max can be configured with up to 96 GB of unified memory as a configure-to-order option (check current pricing at apple.com/shop/buy-mac/mac-studio). At ~74 GB for model weights plus OS overhead, you have ~18–20 GB for KV cache—comfortable for 8K–32K context lengths, tight at 128K+.

The bandwidth limitation matters here. The top-tier M4 Max (16-core CPU, 40-core GPU) delivers 546 GB/s of unified memory bandwidth; the base M4 Max (14-core) is 410 GB/s. Theoretical throughput for a 74 GB model at 546 GB/s: 546 / 74 ≈ 7.4 tok/s maximum before other bottlenecks—lower on the 410 GB/s base variant (~5.5 tok/s). Real-world numbers via MLX land around 6–9 tok/s on the top-tier chip.

Use MLX, not llama.cpp, for Apple Silicon. It’s purpose-built for Metal and consistently outperforms llama.cpp for inference throughput on these chips.

Mac Studio M3 Ultra with 96 GB ($3,999): the better Apple path

The Mac Studio M3 Ultra base configuration ($3,999, 96 GB, 1 TB SSD) offers 819 GB/s of unified memory bandwidth—50% more than M4 Max. Theoretical ceiling for Mistral Small 4 Q4_K_M: 819 / 74 ≈ 11.1 tok/s. Practical MLX results fall around 8–12 tok/s.

That’s still not exciting compared to a properly configured multi-4090 NVIDIA build, but it’s a clean single-machine setup with no PCIe slot drama, no 2,000 W PSU, and no NVLink compatibility research. The M3 Ultra also runs significantly quieter under sustained load.

If you need longer contexts (64K+), step up to the M3 Ultra 256 GB configuration (~$5,999 as of May 2026 after Apple’s price increase on the upgrade). The 256 GB variant gives Mistral Small 4 room for multi-document workflows at the full 262K context window.

For a deeper look at how Mac Studio stacks up against dual-RTX-4090 builds for models that actually do fit on consumer GPUs, see our Mac Studio M3 Ultra vs dual RTX 4090 comparison.


The cloud alternative: RunPod and the API

For most readers, neither a three-4090 build nor a $3,999 Mac Studio is the right call for this specific model.

RunPod H100 PCIe at $1.99/hr

A single H100 PCIe (80 GB VRAM) on RunPod runs Mistral Small 4’s NVFP4 checkpoint natively—the format Mistral AI designed specifically for H100 80 GB. Expected throughput: 60–80 tok/s, far beyond any consumer setup.

At $1.99/hr for spot instances, the math for occasional use is straightforward: 100 hours of inference per month costs $199, with zero upfront investment. That same $199/month amortized over a $7,000 three-4090 build takes nearly 3 years to break even—and that’s before accounting for electricity, depreciation, and the reality that a newer, cheaper GPU will exist in 18 months.

For sustained-use operations (running a team inference server 8+ hours per day), the economics shift. A dedicated H100 rental at current rates runs ~$1,432/month, while a purchased three-4090 build starts looking cheaper over 18–24 months if your workload justifies it. See our RunPod vs local GPU breakdown for the full cost model.

The Mistral API at $0.15/$0.60 per M tokens

Mistral’s hosted API for Small 4 is priced at $0.15 per million input tokens and $0.60 per million output tokens. For comparison:

Monthly token volumeAPI costRunPod H100 (100 hrs)Triple-4090 build (amortized over 24 mo)
5M tokens$3–4.50$199~$292
50M tokens$30–45$199~$292
100M tokens$60–90$199~$292
500M tokens$300–450~$600 (300 hrs)~$292 + electricity
1B tokens$600–900~$1,200 (600 hrs)~$292 + electricity

The API wins decisively for anything under ~300M tokens/month. Above that, a dedicated cloud GPU or owned hardware starts making sense—and only if your workload is steady rather than bursty.


Who should actually run Mistral Small 4 locally

Good local-run candidates:

  • Privacy-first organizations where medical, legal, or personal data can’t leave the building
  • Development teams with high, predictable token volume (1B+ tokens/month) and a budget for dedicated GPU infra
  • Researchers who need the full 262K context window with image inputs for multi-document analysis
  • Anyone who already has a Mac Studio M3 Ultra 96 GB for other work and wants to try it at no marginal cost

Bad local-run candidates:

  • Solo developers doing occasional coding help → use the API, spend $5/month
  • Home lab enthusiasts who want to “run the biggest model” → you’ll get 8 tok/s on an expensive machine; Qwen3 72B on a dual 3090 is faster and more practical for daily use
  • Anyone without a multi-slot HEDT motherboard and a 2,000 W PSU → the GPU costs are the smaller problem

If you’re looking for a model that actually makes sense on a $1,000–2,000 single-GPU home lab build, Mistral Small 3.2 24B (Q4_K_M: ~13 GB, fits on an RTX 4060 Ti 16 GB) delivers strong coding results and runs at 70–90 tok/s on a 4090. That’s the sweet spot for most home lab users.


Frequently Asked Questions

Can I run Mistral Small 4 on a single RTX 4090? Technically yes, but you’ll spend most of the model in system RAM. With Q4_K_M (~74 GB), roughly 65% of the model spills from the 4090’s 24 GB VRAM to DDR5. Realistic inference speed: 5–10 tok/s. For interactive use that’s unusable; for overnight batch jobs it might be acceptable if you have 96+ GB of fast DDR5.

Is Mistral Small 4 actually “small”? The name refers to its inference cost, not its stored size. With 128 experts and only 4 active per token, each forward pass computes roughly 6–8B parameters—equivalent to a small dense model. The 119B total live in memory but most are idle on any given token.

Does Mistral Small 4 work with Ollama? As of May 2026, Ollama supports Mistral Small 4 via the GGUF format. Pull it with ollama pull mistral-small4:latest or specify a quantization level. Note that Ollama won’t automatically use tensor parallelism across multiple GPUs; for multi-GPU inference you’ll need llama.cpp with --n-gpu-layers tuning or vLLM.

What’s the minimum Apple Silicon Mac that can run this model? The Mac Studio M4 Max with 96 GB is the lowest-cost Apple Silicon option that fits Q4_K_M without CPU offload. The Mac Mini M4 Pro (max 64 GB) and Mac Studio M4 Max at 64 GB both fall short—you’d be stuck at Q2_K, which degrades reasoning quality noticeably.

How does Mistral Small 4 compare to Llama 4 Scout for local inference? Llama 4 Scout (109B MoE, ~3.2B active parameters) is lighter on inference compute and fits more easily in memory at lower quantization levels. Mistral Small 4’s 6B active parameters per token makes it more capable on reasoning-heavy tasks, but the memory footprint is similar. The choice depends on your specific workload: Scout handles coding well with a lower memory floor; Small 4 has a slight edge in instruction-following fidelity and vision quality.


Sources

Last updated May 30, 2026. Prices and availability change frequently; verify current rates before purchasing.


Was this article helpful?