How Much VRAM Do You Need to Run Llama Models in 2026
VRAM is the single hardest constraint on running language models locally. CPU is replaceable, RAM is cheap, storage is cheap — but if a model’s weights and KV cache do not fit in your GPU’s video memory, your inference speed collapses by an order of magnitude. This guide gives you the numbers for every Llama variant you would actually consider running on consumer or prosumer hardware in 2026.
If you only have 90 seconds, jump to the recommended GPU by model size table at the end. Otherwise, read on for the math and the caveats.
The simple weight-only math
A model’s weight VRAM cost is essentially:
VRAM = (number of parameters) × (bits per weight) / 8 bits per byte
So a 7B model in FP16 (16 bits per weight) is 7e9 × 16 / 8 = 14e9 bytes ≈ 14 GB. The same
model in 4-bit quantization is 7e9 × 4 / 8 = 3.5e9 bytes ≈ 4 GB — a 75% reduction.
This is the headline number you see in model cards. It is also incomplete. The KV cache, which grows linearly with context length, can match or exceed the weight cost on long contexts. Plan accordingly.
For background on what quantization actually does to model quality, see our quantization explainer.
Llama 3.2 (small models)
Llama 3.2 introduced 1B and 3B text-only models alongside 11B and 90B vision models.
| Model | FP16 weights | Q8 | Q4 (sweet spot) | Q4 + 8K context | Best minimum GPU |
|---|---|---|---|---|---|
| Llama 3.2 1B | ~2 GB | ~1.1 GB | ~0.7 GB | ~1.5 GB | Any modern GPU; runs fine on CPU |
| Llama 3.2 3B | ~6 GB | ~3.3 GB | ~2 GB | ~3 GB | 6 GB+ (RTX 3050, M1 Mac) |
| Llama 3.2 11B Vision | ~22 GB | ~12 GB | ~7 GB | ~9 GB | 12 GB+ (RTX 3060 12GB, 4070) |
| Llama 3.2 90B Vision | ~180 GB | ~95 GB | ~52 GB | ~60 GB | 2× 24 GB (3090/4090) or 1× 80 GB |
Llama 3.2 1B and 3B are the practical choices when you want fast responses on modest hardware, or want to run a model on-device on a phone or laptop. Quality is below 8B, but for classification, simple QA, and code completion the gap is smaller than you might expect.
Llama 3.1 (8B and 70B that everyone runs)
The Llama 3.1 family is what most “run an LLM at home” tutorials are written about, and the 8B and 70B are the canonical reference points for “what fits on consumer hardware.”
| Model | FP16 weights | Q8 | Q4_K_M | Q4 + 16K context | Best minimum GPU |
|---|---|---|---|---|---|
| Llama 3.1 8B | ~16 GB | ~9 GB | ~5 GB | ~7 GB | 8 GB+ (RTX 3060, 4060) — 12 GB ideal |
| Llama 3.1 70B | ~140 GB | ~75 GB | ~40 GB | ~46 GB | 48 GB (A6000) or 2× 24 GB |
| Llama 3.1 405B | ~810 GB | ~430 GB | ~230 GB | ~245 GB | 8× H100, multi-node, or cloud only |
The 8B is the model most people actually run day-to-day. At Q4_K_M it fits comfortably in 8 GB of VRAM, leaving room for a 4–8 K context. With 12 GB you can comfortably go up to 32 K context.
The 70B is the inflection point. With aggressive quantization (Q3 or IQ2), people fit 70B into 24 GB — but quality starts to slip noticeably. The honest minimum for “70B at full quality” is two 3090s or 4090s in a single machine, or an A6000 / RTX 6000 Ada with 48 GB.
The 405B is, frankly, not a consumer model. It exists as a research artifact and is run on multi-GPU enterprise rigs. Most people who want 405B-class quality use it through a hosted API.
Llama 4 (Scout, Maverick, Behemoth — and the MoE twist)
Llama 4 is a mixture-of-experts (MoE) architecture, which complicates the VRAM math. Active parameters drive compute speed; total parameters drive VRAM cost. You have to load the whole model into memory; you just do not run all of it on every token.
| Model | Active params | Total params | FP16 weights | Q4 weights | Realistic minimum GPU |
|---|---|---|---|---|---|
| Llama 4 Scout | ~17 B | ~109 B | ~218 GB | ~62 GB | 80 GB (H100, A100 80) or 2× 48 GB |
| Llama 4 Maverick | ~17 B | ~400 B | ~800 GB | ~225 GB | Multi-node or cloud only |
| Llama 4 Behemoth | ~290 B | ~2 T | not feasible to run locally | — | Cloud / training labs only |
Scout is the only Llama 4 variant that meaningfully shows up on consumer-adjacent hardware. With heavy quantization (Q3 K) and partial CPU offload through GGUF, people have run Scout on 64 GB system RAM plus a 24 GB GPU at a few tokens per second. It is technically possible, not practically pleasant.
The trick MoE makes possible: even though you load 109 GB of weights, only ~17 GB is “hot” on any given token. So your effective compute is closer to a 17B model. This is why MoE serving stacks are popular — high quality at moderate compute cost — but the VRAM footprint remains the full model size, and that is the hard constraint at home.
Context length matters more than people expect
The KV (key/value) cache stores the attention computation for every token in your context window. Its size scales as:
KV cache ≈ 2 × layers × hidden_size × tokens × bytes_per_token
For a Llama 3.1 8B, this works out to roughly 0.25 GB per 1K tokens at FP16. For a 70B, roughly 1.25 GB per 1K tokens. So:
- 8B at 32 K context: ~8 GB just for KV cache (more than the Q4 weights themselves!)
- 70B at 32 K context: ~40 GB just for KV cache
Modern runners support KV cache quantization (commonly Q8 or Q4 cache), which roughly halves or quarters this overhead at minor quality cost. If you plan to use long contexts, enable KV cache quantization in your runner — it is usually a one-line config change.
Recommended GPU by model size
This is the cheat sheet most people are actually looking for:
| You want to run… | Minimum GPU (Q4, short context) | Comfortable GPU (Q4, 16K context) | Headroom GPU (Q5+, 32K context) |
|---|---|---|---|
| Llama 3.2 1B / 3B | Any 4 GB+ card | 8 GB | 12 GB |
| Llama 3.1 8B | 8 GB (RTX 3060, 4060) | 12 GB (3060 12GB, 4070) | 16 GB (4060 Ti 16GB, 5060 Ti 16GB) |
| Llama 3.2 11B Vision | 12 GB | 16 GB | 24 GB (3090, 4090, 5090) |
| Llama 3.1 70B | 24 GB at Q3 (compromised) | 48 GB (A6000) | 2× 24 GB or 80 GB |
| Llama 4 Scout | 80 GB | 96 GB+ | 128 GB+ |
The 16 GB consumer card is the 2026 sweet spot. For most home users, an RTX 5060 Ti 16GB or 4060 Ti 16GB hits a clean point: full Q4_K_M of any 8B model with room for a generous context window, comfortable headroom for image generation in parallel, and 80–90% of the inference performance of a 4090 at a fraction of the price. Going beyond 16 GB only makes sense if you specifically want 70B-class models.
What if your model does not fit?
Four options, in order of preference:
- Pick a smaller model or a tighter quant. Honestly, Llama 3.1 8B at Q4_K_M outperforms most fine-tuned 7B models, and a clean Q4 8B beats a struggling Q2 70B for almost every real task.
- Use CPU offload through GGUF. llama.cpp lets you push some of the layers to system RAM. You will lose 5–20× speed but it is the difference between running and not running. Best when your CPU is fast and your DDR5 is fast — DDR4 systems suffer here.
- Rent a GPU by the hour. RunPod gives you an RTX 4090 for around $0.34/hour on Community Cloud, or an A100 80GB for $1-2/hour. Spin one up, pull the model, run your inference, and shut it down. Total cost for a few hours of 70B-class testing is often under $5.
- Use a hosted endpoint. Together AI, Replicate, OpenRouter, and Hugging Face Inference serve every Llama variant per-token. For occasional 70B-class quality without managing the hardware, this is the simplest answer.
Final notes
VRAM math is approximate, not exact. Different runners (Ollama, LM Studio, vLLM, exllamav2) have slightly different overhead profiles. Different quantization implementations (GGUF Q4_K_M vs AWQ 4-bit) sometimes differ by 5–10% in actual size. Driver version, CUDA version, and even ambient room temperature can affect what fits. Always leave headroom.
The numbers in this guide were measured against published model card sizes for the specific releases noted, with KV cache estimates from llama.cpp’s own profiling. They will be very close to what you see, never exactly identical.
Once you know what you can run, the next question is which runner to load it in — Ollama, LM Studio, llama.cpp, or Jan.ai? See our side-by-side comparison for that decision.