May 21, 2026

Q4 vs Q5 vs Q6 vs Q8 Quantization: Real Quality Loss Numbers for Local LLMs (2026)

By RunAIHome Team · 10 min read

quantizationggufllama.cpplocal-llmq4performancebenchmark

“Q4_K_M is good enough for most use cases” is probably the most repeated piece of local AI advice. It’s also the least useful, because it never comes with the actual quality numbers. How much quality are you actually giving up going from Q8 to Q4? When does it show up in coding tasks? When can you not fit Q5 in VRAM even if you want to? Those are the questions this article answers — with verified benchmarks, not vibes.

The actual perplexity numbers

Perplexity measures how “surprised” a language model is by a test corpus — lower is better. The data below comes from the llama.cpp GitHub discussion tracking quantization quality on a 7B model, where the authors measured perplexity increase over the F16 baseline:

Quantization	Perplexity delta vs F16	Quality verdict
Q8_0	+0.0004	Essentially lossless
Q6_K	+0.0044	Imperceptible in practice
Q5_K_M	+0.0142	Barely measurable
Q5_K_S	+0.0353	Tiny
Q4_K_M	+0.0535	Small but real
Q3_K_M	+0.2437	Noticeable — avoid unless VRAM-constrained

The F16 baseline for a 7B model sits around 5.96 perplexity. Q4_K_M raises that to ~6.01. Q8_0 raises it to 5.96004. In the range of Q4 through Q8, the quality difference is measured in hundredths of a perplexity point — not the catastrophic degradation people often imagine.

That said, aggregate perplexity is not the whole story. Quantization error is nonuniform: it affects certain reasoning chains and vocabulary distributions more than the average suggests. The next sections break that out.

File sizes: 8B vs 70B

Numbers from the llama.cpp official quantize README for Llama-3.1-8B, and from bartowski’s Meta-Llama-3.1-70B-Instruct-GGUF on Hugging Face:

Quantization	8B model	70B model	Bits per weight (8B)
Q4_K_M	4.92 GB	42.5 GB	4.89
Q5_K_M	5.73 GB	49.9 GB	5.70
Q6_K	6.60 GB	57.9 GB	6.56
Q8_0	8.54 GB	75.0 GB	8.50
F16 (reference)	~16.1 GB	~140 GB	16.0

The 70B column is where hardware reality hits hardest. Q4_K_M at 42.5 GB fits across two 24 GB consumer GPUs (two RTX 3090s or two RTX 4090s) with a few gigabytes left for the KV cache. Q5_K_M at 49.9 GB does not — it overflows by about 2 GB before you add any context. Q8_0 at 75 GB needs approximately four 24 GB cards or a Mac Studio with 192 GB unified memory. For anyone running 70B models on consumer hardware, Q4_K_M isn’t a choice; it’s the ceiling.

Speed: how much slower is Q8 than Q4?

LLM text generation is memory-bandwidth-bound: the GPU spends most of its time reading weights from VRAM, not doing multiply-adds. Bigger quantization = more bytes to read per token generated = slower output. The llama.cpp README publishes benchmark numbers for Llama-3.1-8B text generation:

Quantization	Text gen (tok/s)	Slowdown vs Q4_K_M
Q4_K_S	76.7	— (4% faster than Q4_K_M)
Q4_K_M	71.9	baseline
Q5_K_S	69.5	−3%
Q5_K_M	67.2	−7%
Q6_K	58.7	−18%
Q8_0	50.9	−29%

Q8_0 generates tokens 29% more slowly than Q4_K_M. Q5_K_M is only 7% slower. The biggest single jump is between Q6_K and Q8_0 — going from 6.56 bits per weight to 8.50 forces a disproportionate memory bandwidth penalty.

These ratios hold across hardware types because the bottleneck is the same everywhere: bytes read from VRAM or RAM per generated token. Whether you’re on an RTX 4090 doing 130 tok/s or an M3 Max doing 50 tok/s, Q8_0 will run roughly 29% slower than Q4_K_M for generation.

When the quality gap actually shows up

Casual chat and assistant use: The perplexity delta between Q4_K_M and Q8_0 is 0.0531 points — below the threshold that shows up as perceptibly different responses in normal conversation. Experienced users running back-to-back A/B tests on the same prompt can sometimes spot it; typical use doesn’t.

Coding: Here’s a specific number from the JarvisLabs 2026 benchmark using Qwen2.5-32B-Instruct on HumanEval Pass@1: Q4_K_M, AWQ, and BitsandBytes all scored 51.8%. GPTQ 4-bit trailed at 46%. The takeaway: GGUF Q4_K_M preserves coding ability at the same level as AWQ (which uses a more sophisticated quantization algorithm). Going up to Q5_K_M or Q6_K won’t unlock meaningfully better code generation on standard benchmarks.

Math and multi-step reasoning: This is where the gap actually appears. Quantization error accumulates across long reasoning chains — a small token probability shift early in a chain-of-thought sequence can route the model to a wrong step that wouldn’t happen at higher precision. Unsloth’s Qwen3.5 GGUF benchmarks show Q4_K_M perplexity at 6.6097 versus Q5_K_M at 6.5828 — a 0.027-point difference that, while small in absolute terms, corresponds to measurably lower accuracy on math reasoning tasks like MATH and GSM8K. If you’re using a model primarily for multi-step math or long-form structured reasoning, Q5_K_M is the better choice.

Creative writing and long-form generation: Higher precision helps keep narrative coherence over long token runs. Q5_K_M or Q6_K is worth the extra VRAM for work that demands consistent style across thousands of tokens.

Fine-tuned models (instruction, chat): Instruction-tuned models are more robust to quantization than base models because the fine-tuning itself tends to put heavy probability mass on the “right” answer tokens, reducing the sensitivity to small probability shifts. Q4_K_M is nearly always fine here.

The KV cache factor people miss

The model weights are only part of your VRAM budget. Long-context inference also fills the KV cache, and that cache grows linearly with both context length and model size. At long contexts (32K+ tokens), the KV cache adds several gigabytes on top of the 42.5 GB model weights — the exact amount depends on model architecture, number of heads, and context length.

Choosing Q5_K_M over Q4_K_M on a 24 GB card running an 8B model costs you 0.81 GB in model size (~5.73 vs ~4.92 GB). That’s worthwhile. The same choice on a dual-24 GB setup running 70B costs you 7.4 GB (49.9 vs 42.5 GB) and blows past your VRAM headroom.

One practical option: keep Q4_K_M weights but quantize the KV cache separately. llama.cpp supports --cache-type-k q8_0 --cache-type-v q8_0 flags to store the KV cache at 8-bit. This cuts cache VRAM by roughly 50% with minimal quality impact, a useful trade-off when you want long context on a memory-constrained machine.

IQ-quants: the smaller Q4 that still fits

The llama.cpp I-quant family uses importance matrices to allocate precision unevenly across weights — assigning more bits to the parameters that influence output the most. IQ4_XS uses approximately 4.46 bits per weight versus Q4_K_M’s 4.89 bpw, resulting in about 4.17 GiB for an 8B model versus Q4_K_M’s 4.58 GiB. That 400 MB difference matters for edge cases where Q4_K_M barely doesn’t fit.

The trade-off: I-quants decompress more slowly on CPUs; K-quants typically give higher tokens per second on consumer hardware where the decompression happens inline with CUDA. If your model runs fully on GPU, Q4_K_M or Q5_K_M is usually faster. If you need to squeeze one more gigabyte out of a tight VRAM budget, IQ4_XS achieves near-identical perplexity at smaller size.

The use-case decision table

Use case	Recommended	Why
General chat / assistant	Q4_K_M	Perplexity delta imperceptible; saves VRAM
AI coding assistant (via Continue.dev + Ollama)	Q4_K_M	HumanEval Pass@1 identical to higher quants
Multi-step math / reasoning	Q5_K_M	Measurably lower perplexity; fits most 12 GB+ GPUs
Creative writing, long-form	Q5_K_M or Q6_K	Coherence over long context
70B on dual 24 GB GPUs	Q4_K_M only	Q5_K_M doesn’t fit
8B on 8 GB VRAM	Q4_K_M	Q5_K_M at 5.73 GB leaves little KV headroom
Validation / testing accuracy	Q8_0	Near-lossless; use to establish a quality baseline
Limited VRAM, Q4 too big	IQ4_XS	~400 MB smaller, similar quality, slower on CPU

How the K and S designations affect quality

The quantization name encodes both precision level and which tensors get the extra bits. Q5_K_M means 5-bit K-quant, medium mix: it allocates higher precision to the most quality-sensitive tensors, averaging 5.70 bpw across the full model. Q5_K_S (small mix) keeps everything at 5 bits, averaging 5.57 bpw, and is about 120 MB smaller with slightly higher perplexity. The M variant almost always outperforms the S variant in quality-per-byte; the S variant is useful only if you’re genuinely constrained to the smaller size.

For the Q4 tier: Q4_K_M averages 4.89 bpw by storing some tensors at 6 bits. Q4_K_S averages 4.67 bpw and is about 240 MB smaller with marginally worse quality. The perplexity difference between Q4_K_S and Q4_K_M is tiny, which is why Q4_K_S sometimes appears in automated model serving pipelines where the bandwidth savings matter more than the quality edge.

Honest take

If you’re running an 8B model on a 12 GB card and spending any mental energy debating Q4 vs Q5, spend it elsewhere. Pick Q5_K_M if you have the headroom; Q4_K_M if you’re tight. The quality difference won’t be the bottleneck in your workflow.

Where quantization actually matters is the 70B case. The entire conversation changes when you’re working with a model that takes up 42.5 GB at Q4_K_M — at that point Q5 simply doesn’t fit on common consumer hardware, and the decision is made for you by arithmetic.

The one case where going to Q8_0 makes unambiguous sense: when you’re establishing a quality baseline for a fine-tuned model you’re evaluating. Q8_0 is as close to F16 behavior as you’ll get without running F16, and the 29% speed penalty is irrelevant for evaluation workloads. Run Q8_0 once to establish ground truth, then compare Q4_K_M against it — if outputs match, you’re done.

For inference serving at any scale, see the vLLM vs Ollama comparison — the quantization choice interacts with which server you use. vLLM has native support for GGUF Q4_K_M since early 2026, and selecting the right model tier for your VRAM is upstream of the quantization decision. If you’re still learning how these formats differ fundamentally, Local LLM Quantization Explained covers GGUF, GPTQ, AWQ, and bitsandbytes from scratch.

1V1 PLAYBOOK · LOCAL LLM

Cut your local AI bill from $400/month cloud GPU to $47/month at home.

4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.

Get it for $19 (early bird) →

Sources

Last updated May 21, 2026. Model file sizes vary by architecture; verify against the specific model’s Hugging Face page before purchasing hardware.

Recommended Gear

The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):

Was this article helpful?