Q4 vs Q5 vs Q6 vs Q8 Quantization: Real Quality Loss Numbers for Local LLMs (2026)
“Q4_K_M is good enough for most use cases” is probably the most repeated piece of local AI advice. It’s also the least useful, because it never comes with the actual quality numbers. How much quality are you actually giving up going from Q8 to Q4? When does it show up in coding tasks? When can you not fit Q5 in VRAM even if you want to? Those are the questions this article answers — with verified benchmarks, not vibes.
The actual perplexity numbers
Perplexity measures how “surprised” a language model is by a test corpus — lower is better. The data below comes from the llama.cpp GitHub discussion tracking quantization quality on a 7B model, where the authors measured perplexity increase over the F16 baseline:
| Quantization | Perplexity delta vs F16 | Quality verdict |
|---|---|---|
| Q8_0 | +0.0004 | Essentially lossless |
| Q6_K | +0.0044 | Imperceptible in practice |
| Q5_K_M | +0.0142 | Barely measurable |
| Q5_K_S | +0.0353 | Tiny |
| Q4_K_M | +0.0535 | Small but real |
| Q3_K_M | +0.2437 | Noticeable — avoid unless VRAM-constrained |
The F16 baseline for a 7B model sits around 5.96 perplexity. Q4_K_M raises that to ~6.01. Q8_0 raises it to 5.96004. In the range of Q4 through Q8, the quality difference is measured in hundredths of a perplexity point — not the catastrophic degradation people often imagine.
That said, aggregate perplexity is not the whole story. Quantization error is nonuniform: it affects certain reasoning chains and vocabulary distributions more than the average suggests. The next sections break that out.
File sizes: 8B vs 70B
Numbers from the llama.cpp official quantize README for Llama-3.1-8B, and from bartowski’s Meta-Llama-3.1-70B-Instruct-GGUF on Hugging Face:
| Quantization | 8B model | 70B model | Bits per weight (8B) |
|---|---|---|---|
| Q4_K_M | 4.92 GB | 42.5 GB | 4.89 |
| Q5_K_M | 5.73 GB | 49.9 GB | 5.70 |
| Q6_K | 6.60 GB | 57.9 GB | 6.56 |
| Q8_0 | 8.54 GB | 75.0 GB | 8.50 |
| F16 (reference) | ~16.1 GB | ~140 GB | 16.0 |
The 70B column is where hardware reality hits hardest. Q4_K_M at 42.5 GB fits across two 24 GB consumer GPUs (two RTX 3090s or two RTX 4090s) with a few gigabytes left for the KV cache. Q5_K_M at 49.9 GB does not — it overflows by about 2 GB before you add any context. Q8_0 at 75 GB needs approximately four 24 GB cards or a Mac Studio with 192 GB unified memory. For anyone running 70B models on consumer hardware, Q4_K_M isn’t a choice; it’s the ceiling.
Speed: how much slower is Q8 than Q4?
LLM text generation is memory-bandwidth-bound: the GPU spends most of its time reading weights from VRAM, not doing multiply-adds. Bigger quantization = more bytes to read per token generated = slower output. The llama.cpp README publishes benchmark numbers for Llama-3.1-8B text generation:
| Quantization | Text gen (tok/s) | Slowdown vs Q4_K_M |
|---|---|---|
| Q4_K_S | 76.7 | — (4% faster than Q4_K_M) |
| Q4_K_M | 71.9 | baseline |
| Q5_K_S | 69.5 | −3% |
| Q5_K_M | 67.2 | −7% |
| Q6_K | 58.7 | −18% |
| Q8_0 | 50.9 | −29% |
Q8_0 generates tokens 29% more slowly than Q4_K_M. Q5_K_M is only 7% slower. The biggest single jump is between Q6_K and Q8_0 — going from 6.56 bits per weight to 8.50 forces a disproportionate memory bandwidth penalty.
These ratios hold across hardware types because the bottleneck is the same everywhere: bytes read from VRAM or RAM per generated token. Whether you’re on an RTX 4090 doing 130 tok/s or an M3 Max doing 50 tok/s, Q8_0 will run roughly 29% slower than Q4_K_M for generation.
When the quality gap actually shows up
Casual chat and assistant use: The perplexity delta between Q4_K_M and Q8_0 is 0.0531 points — below the threshold that shows up as perceptibly different responses in normal conversation. Experienced users running back-to-back A/B tests on the same prompt can sometimes spot it; typical use doesn’t.
Coding: Here’s a specific number from the JarvisLabs 2026 benchmark using Qwen2.5-32B-Instruct on HumanEval Pass@1: Q4_K_M, AWQ, and BitsandBytes all scored 51.8%. GPTQ 4-bit trailed at 46%. The takeaway: GGUF Q4_K_M preserves coding ability at the same level as AWQ (which uses a more sophisticated quantization algorithm). Going up to Q5_K_M or Q6_K won’t unlock meaningfully better code generation on standard benchmarks.
Math and multi-step reasoning: This is where the gap actually appears. Quantization error accumulates across long reasoning chains — a small token probability shift early in a chain-of-thought sequence can route the model to a wrong step that wouldn’t happen at higher precision. Unsloth’s Qwen3.5 GGUF benchmarks show Q4_K_M perplexity at 6.6097 versus Q5_K_M at 6.5828 — a 0.027-point difference that, while small in absolute terms, corresponds to measurably lower accuracy on math reasoning tasks like MATH and GSM8K. If you’re using a model primarily for multi-step math or long-form structured reasoning, Q5_K_M is the better choice.
Creative writing and long-form generation: Higher precision helps keep narrative coherence over long token runs. Q5_K_M or Q6_K is worth the extra VRAM for work that demands consistent style across thousands of tokens.
Fine-tuned models (instruction, chat): Instruction-tuned models are more robust to quantization than base models because the fine-tuning itself tends to put heavy probability mass on the “right” answer tokens, reducing the sensitivity to small probability shifts. Q4_K_M is nearly always fine here.
The KV cache factor people miss
The model weights are only part of your VRAM budget. Long-context inference also fills the KV cache, and that cache grows linearly with both context length and model size. At long contexts (32K+ tokens), the KV cache adds several gigabytes on top of the 42.5 GB model weights — the exact amount depends on model architecture, number of heads, and context length.
Choosing Q5_K_M over Q4_K_M on a 24 GB card running an 8B model costs you 0.81 GB in model size (~5.73 vs ~4.92 GB). That’s worthwhile. The same choice on a dual-24 GB setup running 70B costs you 7.4 GB (49.9 vs 42.5 GB) and blows past your VRAM headroom.
One practical option: keep Q4_K_M weights but quantize the KV cache separately. llama.cpp supports --cache-type-k q8_0 --cache-type-v q8_0 flags to store the KV cache at 8-bit. This cuts cache VRAM by roughly 50% with minimal quality impact, a useful trade-off when you want long context on a memory-constrained machine.
IQ-quants: the smaller Q4 that still fits
The llama.cpp I-quant family uses importance matrices to allocate precision unevenly across weights — assigning more bits to the parameters that influence output the most. IQ4_XS uses approximately 4.46 bits per weight versus Q4_K_M’s 4.89 bpw, resulting in about 4.17 GiB for an 8B model versus Q4_K_M’s 4.58 GiB. That 400 MB difference matters for edge cases where Q4_K_M barely doesn’t fit.
The trade-off: I-quants decompress more slowly on CPUs; K-quants typically give higher tokens per second on consumer hardware where the decompression happens inline with CUDA. If your model runs fully on GPU, Q4_K_M or Q5_K_M is usually faster. If you need to squeeze one more gigabyte out of a tight VRAM budget, IQ4_XS achieves near-identical perplexity at smaller size.
The use-case decision table
| Use case | Recommended | Why |
|---|---|---|
| General chat / assistant | Q4_K_M | Perplexity delta imperceptible; saves VRAM |
| AI coding assistant (via Continue.dev + Ollama) | Q4_K_M | HumanEval Pass@1 identical to higher quants |
| Multi-step math / reasoning | Q5_K_M | Measurably lower perplexity; fits most 12 GB+ GPUs |
| Creative writing, long-form | Q5_K_M or Q6_K | Coherence over long context |
| 70B on dual 24 GB GPUs | Q4_K_M only | Q5_K_M doesn’t fit |
| 8B on 8 GB VRAM | Q4_K_M | Q5_K_M at 5.73 GB leaves little KV headroom |
| Validation / testing accuracy | Q8_0 | Near-lossless; use to establish a quality baseline |
| Limited VRAM, Q4 too big | IQ4_XS | ~400 MB smaller, similar quality, slower on CPU |
How the K and S designations affect quality
The quantization name encodes both precision level and which tensors get the extra bits. Q5_K_M means 5-bit K-quant, medium mix: it allocates higher precision to the most quality-sensitive tensors, averaging 5.70 bpw across the full model. Q5_K_S (small mix) keeps everything at 5 bits, averaging 5.57 bpw, and is about 120 MB smaller with slightly higher perplexity. The M variant almost always outperforms the S variant in quality-per-byte; the S variant is useful only if you’re genuinely constrained to the smaller size.
For the Q4 tier: Q4_K_M averages 4.89 bpw by storing some tensors at 6 bits. Q4_K_S averages 4.67 bpw and is about 240 MB smaller with marginally worse quality. The perplexity difference between Q4_K_S and Q4_K_M is tiny, which is why Q4_K_S sometimes appears in automated model serving pipelines where the bandwidth savings matter more than the quality edge.
Honest take
If you’re running an 8B model on a 12 GB card and spending any mental energy debating Q4 vs Q5, spend it elsewhere. Pick Q5_K_M if you have the headroom; Q4_K_M if you’re tight. The quality difference won’t be the bottleneck in your workflow.
Where quantization actually matters is the 70B case. The entire conversation changes when you’re working with a model that takes up 42.5 GB at Q4_K_M — at that point Q5 simply doesn’t fit on common consumer hardware, and the decision is made for you by arithmetic.
The one case where going to Q8_0 makes unambiguous sense: when you’re establishing a quality baseline for a fine-tuned model you’re evaluating. Q8_0 is as close to F16 behavior as you’ll get without running F16, and the 29% speed penalty is irrelevant for evaluation workloads. Run Q8_0 once to establish ground truth, then compare Q4_K_M against it — if outputs match, you’re done.
For inference serving at any scale, see the vLLM vs Ollama comparison — the quantization choice interacts with which server you use. vLLM has native support for GGUF Q4_K_M since early 2026, and selecting the right model tier for your VRAM is upstream of the quantization decision. If you’re still learning how these formats differ fundamentally, Local LLM Quantization Explained covers GGUF, GPTQ, AWQ, and bitsandbytes from scratch.
1V1 PLAYBOOK · LOCAL LLM
Cut your local AI bill from $400/month cloud GPU to $47/month at home.
4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.
Get it for $19 (early bird) →Sources
- Quantization methods comparison — llama.cpp GitHub Discussion #2094
- Quantizing Models — llama.cpp official README with Llama-3.1-8B benchmark table
- bartowski/Meta-Llama-3.1-70B-Instruct-GGUF — Hugging Face (70B file sizes)
- Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct — arXiv 2601.14277
- Perplexity (Quality of Generation) Scores — llama.cpp GitHub Discussion #406
- The Complete Guide to LLM Quantization with vLLM — JarvisLabs (HumanEval 51.8% benchmark)
- Qwen3.5 GGUF Benchmarks — Unsloth Documentation
- Choosing a GGUF Model: K-Quants, I-Quants, and Legacy Formats — The Kaitchup
Last updated May 21, 2026. Model file sizes vary by architecture; verify against the specific model’s Hugging Face page before purchasing hardware.
Recommended Gear
The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →