Llama 4 Scout for Local AI in 2026: What "17B Active Parameters" Actually Means for Your GPU
When Meta released Llama 4 Scout in April 2025, the headline was irresistible: a 17B active-parameter model with a 10 million token context window. Home lab forums lit up. People assumed Scout would run on their existing RTX 4090, since 17B models at Q4 need roughly 10–12 GB of VRAM.
Then they typed ollama pull llama4:scout and watched 67 GB download.
The “17B active parameters” claim is technically accurate — Scout only activates 17 billion weights per token during inference. But the model has 109 billion total parameters spread across 16 expert networks, and all 109 billion must be loaded into memory before a single token generates. The active-vs-total distinction matters for compute throughput, not for how much VRAM or RAM you need.
This is the article most Scout guides skip. Here’s the actual hardware math.
How Scout’s MoE Architecture Works (and Why It Matters for VRAM)
Scout is a Mixture-of-Experts (MoE) model. Unlike a dense model where every weight fires on every token, MoE divides the feedforward layers into 16 specialized “expert” networks. A router decides which 1–2 experts handle each token. The attention layers are dense (always active); the FFN experts are sparse (mostly idle).
The math behind 17B active parameters:
- Attention layers shared across all tokens: ~7B parameters
- Each of the 16 expert FFN networks: ~6.4B parameters each
- Active at inference: attention (7B) + ~1–2 experts (~6.4–12.8B) ≈ 13–20B, averaging ~17B
The catch: all 16 experts must be in memory so the router can dispatch tokens to whichever expert it selects. You can’t pre-load only the “active” experts — you don’t know which one until runtime.
This is why Scout’s memory footprint is similar to a 109B dense model, not a 17B one. The inference compute resembles a 17B model (fast generation once loaded). The loading requirement resembles a 109B model (slow setup, large memory).
The VRAM Reality: Scout’s File Sizes by Quantization
The authoritative numbers come from Unsloth’s GGUF releases and the official Ollama library:
| Quantization | Bits/weight | File size | Minimum VRAM (GPU-only) |
|---|---|---|---|
| BF16 (unquantized) | 16 | 216 GB | 4× A100 80GB |
| Q8_0 | 8 | 115 GB | Mac Studio M4 Ultra 192GB |
| Q4_K_M (Ollama default) | ~4.5 | 67 GB | 2× RTX 4090 or Mac 128GB |
| UD-Q4_K_XL (Unsloth) | 4.5 | 65.6 GB | Same as above |
| UD-Q2_K_XL (Unsloth, recommended) | 2.71 | 42.2 GB | 2× RTX 3090 |
| UD-IQ1_S (Unsloth, smallest) | 1.78 | 33.8 GB | RTX 5090 32GB (partial offload) |
The Unsloth dynamic quantization (UD) variants use a smart approach: they quantize the large MoE expert layers more aggressively, while leaving attention and embedding layers at 4–6 bit. This preserves quality better than applying uniform low-bit quantization across the entire model.
The practical takeaway:
ollama pull llama4:scoutgives you the 67 GB Q4_K_M version- You need 70+ GB of combined memory (VRAM or unified) to run it without CPU offloading
- For single-GPU users, CPU offloading is the only path — with significant speed penalties
GPU Decision Matrix: What Configuration Runs Scout How Fast
| Hardware | Combined VRAM | Best quant that fits | Approx. tok/s | Assessment |
|---|---|---|---|---|
| Single RTX 4090 (24GB) | 24 GB | UD-IQ1_S with ~10GB CPU offload | ~10–15 | Possible, slow |
| Single RTX 5090 (32GB) | 32 GB | UD-IQ1_S (~34GB, minor offload) | ~20–28 | Borderline, usable |
| 2× RTX 3090 (48GB) | 48 GB | UD-Q2_K_XL (42.2GB, fits) | ~28–38 | Good value path |
| 2× RTX 4090 (48GB) | 48 GB | UD-Q2_K_XL (42.2GB, fits) | ~35–48 | High-end consumer |
| 3× RTX 4090 (72GB) | 72 GB | Q4_K_M (67GB, fits) | ~40–55 | Top-tier home setup |
| Mac Studio M4 Max (128GB) | 128 GB unified | Q4_K_M (67GB, fits easily) | ~18–25 | Best single-device option |
| Mac Studio M4 Ultra (192GB) | 192 GB unified | Q8_0 (115GB, fits) | ~12–18 | Full quality, lower throughput |
Notes:
- Tok/s estimates are for Scout’s active-parameter profile (~17B active), which generates tokens faster than a dense 70B model but slower than a dense 17B model (overhead from expert routing)
- CPU offloading tokens-per-second drops sharply as more layers hit system RAM — PCIe 4.0 bandwidth (32 GB/s) versus GDDR6X (1008 GB/s on RTX 4090) is the bottleneck
- For multi-GPU tensor parallelism with llama.cpp, you need a recent build with true tensor parallel support; NVLink is dead on consumer Ampere and later
The clear winner for a single home device with no compromise: Mac Studio M4 Max with 128GB. The $1,999 entry price gets you enough unified memory bandwidth to run Scout at full Q4 quality. See the Mac Studio vs dual RTX 4090 breakdown for full context.
For NVIDIA setups, the 2× RTX 3090 path running UD-Q2_K_XL (42.2 GB, fits in 48 GB combined) hits a reasonable price-to-performance point. Used RTX 3090s run $700–$1,000 on eBay — the value analysis is here.
Don’t want to buy hardware to evaluate Scout? Rent an A100 80GB on RunPod for $1.19/hr and run Q8_0 to see if it fits your workload before committing to hardware.
Scout vs Llama 3.3 70B: The Quality Comparison
The benchmark picture is more nuanced than headlines suggest.
| Benchmark | Scout (109B MoE) | Llama 3.3 70B | Winner |
|---|---|---|---|
| MMLU (general knowledge) | 79.6% | 86.0% | Llama 3.3 70B |
| MMLU-Pro (hard subset) | 74.3% | 68.9% | Scout |
| MGSM (multilingual math) | 90.6% | 91.1% | Llama 3.3 70B (marginal) |
| GPQA Diamond (PhD-level science) | 57.2% | — | Scout |
| DocVQA (document understanding) | 94.4% | N/A (text only) | Scout (unique capability) |
| ChartQA | 88.8% | N/A (text only) | Scout (unique capability) |
| HumanEval (coding) | — | 88.4% | Llama 3.3 70B |
The headline MMLU number (standard test) goes to Llama 3.3 70B by 6.4 percentage points. On MMLU-Pro — the harder subset that weeds out pattern-matching — Scout wins by 5.4 points. The GPQA Diamond score (57.2%) is notably strong, suggesting Scout reasons better on difficult scientific problems despite losing the general-knowledge headline.
The more important comparison for local AI use: Scout is a multimodal model; Llama 3.3 70B is text-only. If your use case includes reading PDFs, analyzing charts, or processing images alongside text, Scout has no head-to-head competitor at this open-weights quality level.
For pure text tasks — coding, writing, summarization — Llama 3.3 70B remains the stronger choice, and it fits better in commodity hardware at 40 GB for Q4_K_M versus Scout’s 67 GB. For reference, the quantization quality tradeoffs for either model are covered here.
The 10 Million Token Context Window (and Why You Probably Won’t Use It)
Meta’s official context limit for Scout is 10 million tokens — roughly 7.5 million words, or a small library of documents. The actual Ollama library entry shows 128K as the functional context window, and this is intentional: a 10M context requires extraordinary memory.
At Q4 (67 GB model), processing a 10M token context in a single pass would require additional KV cache memory proportional to context length. At 10M tokens with Scout’s architecture, the KV cache alone would dwarf the model weights — running into hundreds of gigabytes of additional memory. The 128K Ollama default is practical.
The context advantage is real at 128K: Llama 3.3 70B’s maximum is 128K as well, and Scout hits that limit while activating dramatically fewer compute resources per token. For long-document processing, Scout’s throughput advantage (faster token generation due to fewer active parameters) compounds over a long context.
For most home AI use cases — coding assistance, document Q&A, image analysis — the 128K default context is sufficient. If you genuinely need million-token context, that workload belongs on cloud infrastructure regardless.
How to Run Scout on Ollama
Ollama 0.22.0 (released April 2026) ships native Windows binaries with full Llama 4 support. The standard install path:
# Pull the default Q4_K_M variant (~67 GB download)
ollama pull llama4:scout
# Run with text input
ollama run llama4:scout
# Run with image input (native multimodal)
ollama run llama4:scout
>>> Describe what you see in this diagram: /path/to/image.png
# Check GPU/CPU split (important — confirm GPU is actually being used)
ollama ps
For memory-constrained setups, the Unsloth dynamic quantization variants are not yet in the official Ollama registry as of this writing — you would pull them via llama.cpp or LM Studio directly from Hugging Face:
# Smaller Unsloth variant via llama.cpp (42.2 GB — fits in 2x 3090)
./llama-cli -hf unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF:UD-Q2_K_XL
Run ollama ps after loading the model. If the output shows partial CPU offload (100% is GPU-only; anything less means layers are running on CPU), expect tok/s to drop proportionally. A 50% GPU offload on a 67 GB model typically drops throughput to 30–40% of full-VRAM speed.
For the full vLLM vs Ollama comparison for concurrent users, note that Scout’s MoE architecture benefits more from vLLM’s continuous batching at scale than a dense model would — the active-parameter overhead per batch is fixed at 17B regardless of how many requests you serve simultaneously.
Who Should Switch to Scout
Switch to Scout if:
- You process images, charts, PDFs, or screenshots alongside text — Llama 3.3 70B can’t do this
- You have a Mac Studio with 128GB+ unified memory already
- You’re running a 2× RTX 3090 or 2× RTX 4090 setup and want the best open-weights model that fits
- Your workload is GPQA-style reasoning rather than coding or general knowledge retrieval
Stick with Llama 3.3 70B if:
- You have a single 24GB GPU and care about generation speed — Llama 3.3 at Q4 with moderate CPU offload still runs faster
- Coding is your primary use case (HumanEval 88.4% vs Scout’s unreported coding score)
- You want the simplest hardware path — Llama 3.3 70B Q4 fits in 2× RTX 3090 with room to spare, and the RTX 3090 is currently available used at $700–$1,000
- MMLU matters to you — Llama 3.3 wins by 6.4 points on the standard benchmark
The close-call case: single RTX 4090 users. Scout at UD-IQ1_S (33.8 GB) with 10 GB CPU offload will run, but at 10–15 tok/s — usable for multimodal tasks where you’re not waiting for a real-time conversation. For pure text inference, Llama 3.3 70B at that same hardware level is more practical.
Scout’s multimodal native capability — not an afterthought, not via a separate vision model — is genuinely differentiated. For home AI setups where image and document processing is part of the workflow, it’s the right model to run despite the harder hardware requirements. For pure text workloads on commodity hardware, Llama 3.3 70B remains the easier path.
See the full open-weights model comparison here for how Scout fits against Qwen3 and other current alternatives.
1V1 PLAYBOOK · LOCAL LLM
Cut your local AI bill from $400/month cloud GPU to $47/month at home.
4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.
Get it for $19 (early bird) →Sources
- The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation — Meta AI
- unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF — Hugging Face
- llama4:scout model page — Ollama Library
- Llama 3.3 70B Instruct vs Llama 4 Scout Comparison — llm-stats.com
- Llama 4 GPU System Requirements — apxml.com
- Llama 4 Hardware Guide — Scout 12GB, Maverick 48GB+ — Compute Market
- Llama 4 Guide: Running Scout and Maverick Locally — InsiderLLM
- How to Run Llama 4 on Ollama — Serverman
- Llama 4: How to Run & Fine-tune — Unsloth Documentation
Last updated May 23, 2026. Hardware prices and model availability change; verify current listings before purchasing.
Recommended Gear
The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →