Local LLM Quantization Explained: GGUF, GPTQ, AWQ, and Bitsandbytes Compared

local-llmquantizationggufbeginner

If you have ever opened the Files tab of a model on Hugging Face and wondered why there are fifteen different .gguf files with cryptic names like Q4_K_M, Q5_0, or IQ2_XS, this guide is for you. Quantization is the single most important concept for anyone running large language models on a consumer GPU — and the difference between picking the right format and the wrong one is often the difference between a model that runs at 30 tokens per second on your machine and one that does not run at all.

This is a practitioner-level walkthrough. No transformer math, no PyTorch deep dives — just what you need to choose a quant, predict its VRAM cost, and avoid the common mistakes.

What quantization actually is

A model’s “weights” are the billions of floating-point numbers it learned during training. By default these are stored as 16-bit floats (FP16) or sometimes 32-bit (FP32). A 7-billion parameter model in FP16 takes roughly 14 GB of VRAM just for the weights — before you add the KV cache for context, the activations, or anything else.

Quantization reduces the precision of those numbers. Instead of 16 bits per weight, you store each one in 8, 6, 5, 4, 3, or even 2 bits. The model gets dramatically smaller and faster, at the cost of a little accuracy.

The simple rule of thumb: VRAM needed ≈ parameters × bits / 8 bits per byte.

Model size (params)FP16 VRAM8-bit VRAM4-bit VRAM3-bit VRAM
1B~2 GB~1 GB~0.6 GB~0.5 GB
3B~6 GB~3 GB~1.8 GB~1.4 GB
7B~14 GB~7 GB~4 GB~3 GB
8B~16 GB~8 GB~5 GB~3.5 GB
13B~26 GB~13 GB~7.5 GB~5.5 GB
30B~60 GB~30 GB~17 GB~13 GB
70B~140 GB~70 GB~40 GB~30 GB

These are weight-only numbers; in practice you need another 2–6 GB on top for the KV cache and activations during inference (more if you use long context).

The four main formats

There are dozens of quantization schemes in research papers, but four cover almost every model you will actually download in 2026.

GGUF — the de facto standard for consumer hardware

GGUF (GPT-Generated Unified Format) is the format used by llama.cpp and everything built on top of it: Ollama, LM Studio, Jan.ai, KoboldCpp. If you are running models on a desktop, laptop, or even a phone, you are almost certainly using GGUF whether you know it or not.

What makes GGUF dominant:

  • Runs on CPU, GPU, or both — you can put part of the model on the GPU and let the rest spill onto system RAM, which is the only way to fit a 70B model on a 24 GB card.
  • Apple Silicon support — Metal acceleration on M-series Macs is excellent.
  • Wide quantization support — every bit width from 2 to 8, plus mixed-precision “K-quants” and the newer “I-quants” (importance-based, smaller-but-smarter).
  • Single-file — the model, tokenizer, and metadata all live in one .gguf file.

The naming scheme looks intimidating but is actually consistent:

NameMeaningWhen to use
Q8_08-bit, basicQuality very close to FP16; biggest GGUF you would normally download
Q6_K6-bit K-quantExcellent quality; good for capable hardware
Q5_K_M5-bit K-quant, mixedStrong quality–size balance
Q4_K_M4-bit K-quant, mixedThe sweet spot for most users
Q4_K_S4-bit K-quant, smallSlightly smaller, slightly worse than Q4_K_M
Q3_K_M3-bit K-quant, mixedNoticeable quality drop, but viable on tight VRAM
IQ4_XS4-bit i-quant, extra smallSmaller and often higher quality than Q4_K_S
IQ2_XS2-bit i-quantLast resort; expect quality issues

If you only remember one thing: Q4_K_M is the default choice. It runs almost any model on half the VRAM of FP16 with quality losses that are hard to spot in normal use.

GPTQ — older, GPU-only, still around

GPTQ (Generative Pre-trained Transformer Quantization) is the format that popularized 4-bit LLMs back in 2023. It is post-training quantization, GPU-only, and shipped natively in the Hugging Face Transformers library through the auto-gptq and optimum packages.

GPTQ is not dead, but for inference it has been largely superseded:

  • Pro: Tight integration with the Python ML stack — load with transformers, fine-tune with PEFT, ship to vLLM or TGI for serving.
  • Pro: Often faster than GGUF on a single GPU for batched serving.
  • Con: GPU-only — no CPU offload, no Apple Silicon.
  • Con: Quality is generally a half-step behind AWQ and on par with GGUF Q4.
  • Con: Less actively maintained than GGUF or AWQ; many newer models do not get GPTQ conversions at all.

You will mostly see GPTQ when an engineer is integrating a model into a Python service. For running things on your desktop, GGUF is usually a better choice.

AWQ — activation-aware, generally higher quality

AWQ (Activation-aware Weight Quantization) is a smarter post-training approach: instead of quantizing every weight equally, it identifies the small subset of weights that matter most for the model’s actual outputs and protects them with higher precision.

What this means in practice:

  • Higher quality at the same bit width — AWQ 4-bit usually beats GPTQ 4-bit and is competitive with GGUF Q5_K_M on benchmarks.
  • GPU-only — no CPU fallback.
  • Best fit for serving — works with vLLM, SGLang, and TensorRT-LLM. If you are deploying an inference server with multiple users and want maximum throughput per dollar, AWQ is usually the format to convert to.

For a single user on a single GPU, the quality advantage over GGUF Q4_K_M is real but small. For a serving stack, AWQ is often worth the extra setup.

Bitsandbytes — easy mode for Transformers

bitsandbytes is the library that made 4-bit and 8-bit quantization a one-line change in Hugging Face Transformers. You import the model with load_in_4bit=True and bitsandbytes does the rest at load time, in memory.

  • Pro: Trivial to use. No conversion step, no separate file format.
  • Pro: Standard for QLoRA-style fine-tuning — most LoRA training pipelines run on bitsandbytes-quantized base models.
  • Con: Slower at inference than GGUF, GPTQ, or AWQ — bitsandbytes was built for training, not throughput.
  • Con: Higher memory overhead than dedicated formats.

If you are training (LoRA fine-tunes, QLoRA), you want bitsandbytes. If you are serving, you want one of the other three.

Picking a format: a five-second flowchart

What are you doing with the model?

  Running locally on consumer hardware (CPU, GPU, or Mac)
    → GGUF (use Q4_K_M unless you have a reason not to)

  Building a Python service or notebook with HF Transformers
    → AWQ if quality matters, GPTQ if AWQ is unavailable

  Fine-tuning with QLoRA
    → bitsandbytes (4-bit NF4)

  Serving production traffic with vLLM / TGI / SGLang
    → AWQ for most cases, FP8 if your GPU supports it natively

That covers maybe 95% of real workflows.

Common mistakes

Downloading a quant that does not fit your VRAM. The model card on Hugging Face usually lists exact file sizes. Add 20–30% on top for activation overhead and you will rarely be surprised. If you have 16 GB of VRAM, do not try to load a Q5_K_M of a 13B model — pick Q4_K_M or smaller.

Using too aggressive a quant when you do not have to. Q2 and IQ2 quants exist for a reason — fitting a 70B model into 24 GB — but at those bit widths the model genuinely starts to lose track of facts and instructions. If a smaller, higher-quality model fits comfortably, use that instead.

Ignoring the K-quant suffix. Q4_0 (legacy) and Q4_K_M (modern K-quant) are both 4-bit, but the K-quant version is meaningfully better. There is almost never a reason to download a _0 quant in 2026.

Forgetting context. A model’s KV cache scales with (context length × layer count × hidden size × bytes per token). At 32 K context on a 70B model, the KV cache alone can be 20+ GB. If you plan to use long contexts, leave headroom.

Where to find quants

  • Hugging Face: Search the model name plus “GGUF” or “AWQ” — community contributors like bartowski, mradermacher, and TheBloke (historically) maintain GGUF conversions of almost every popular model within hours of release.
  • Ollama Library: Built-in catalog at ollama.com/library — every model there is pre-quantized GGUF, mostly Q4_K_M by default.
  • LM Studio: Has a built-in browser that pulls from Hugging Face and shows compatibility badges based on your hardware.

Bottom line

For 90% of people running LLMs on a home GPU, the answer is: download the GGUF Q4_K_M of whatever model you want to run, in the largest size that fits in your VRAM with 3 GB of headroom for context. That single rule will get you good quality, fast inference, and broad compatibility with every tool worth using.

If you want to estimate what fits before you download, see our companion guide on how much VRAM you need for Llama models. And once you have a quantized model in hand, the next question is which runner to load it in — that is covered in our comparison of Ollama, LM Studio, llama.cpp, and Jan.ai.