Jun 14, 2026

Gemma 4 QAT for Local AI in 2026: How Google's June 5 Checkpoints Put the 26B in 15GB

By RunAIHome Team · 11 min read

gemmagoogleqatquantizationlocal-aillmvramollama

On June 5, 2026, Google released Quantization-Aware Training (QAT) checkpoints for every Gemma 4 size. The practical result: the 26B-A4B model that needed roughly 17 GB of VRAM at standard Q4 — over the limit of a 16 GB card — now runs in about 15 GB with near-original quality. The headline figure Google quotes is a ~72% VRAM cut versus the BF16 baseline.

That changes the GPU recommendation we published on May 26. Back then, the Gemma 4 GPU guide had to warn 16 GB owners that the 26B MoE technically loaded but spilled KV cache to system RAM past ~1,500 tokens. The QAT checkpoints largely close that gap. They also introduce a trap: if you convert the checkpoints to GGUF yourself the wrong way, you lose most of the quality QAT was supposed to preserve.

TL;DR

QAT shrinks every Gemma 4 model by about 72% over BF16 with almost no quality loss, so the 26B-A4B finally fits a 16 GB card and the 31B fits 24 GB comfortably. The catch: don’t hand-convert the checkpoints — use Unsloth’s pre-converted UD-Q4_K_XL GGUFs or Ollama’s -it-qat tags. vLLM users get compressed-tensors checkpoints for every size except the 26B MoE.

	16 GB GPU (e.g. 5060 Ti 16GB)	24 GB GPU (e.g. used RTX 3090)	Apple Silicon / unified
Best QAT fit	26B-A4B (~15 GB) or 12B (~7 GB)	31B (~18 GB) with room for context	26B/31B via MLX
What you gain	26B now stays on-GPU at full speed	31B at full 256K context headroom	Largest models on one box
The catch	Tight margin; cap context if doing long docs	Was already fine pre-QAT; now has slack	E4B has a known Triton speed bug

Honest take: If you own a 16 GB card and skipped the 26B MoE because of the VRAM overflow, the QAT checkpoint is the update that makes it actually usable — pull gemma4:26b-a4b-it-qat and stop fighting your KV cache.

What QAT actually does (and why it’s not just another Q4)

Standard quantization — Post-Training Quantization (PTQ) — takes a fully trained BF16 model and compresses the weights to 4-bit afterward. The model never “knew” it would be quantized, so rounding error accumulates and quality drops, sometimes by several points on reasoning benchmarks.

Quantization-Aware Training simulates the 4-bit rounding during training. The model learns to place its weights where quantization hurts least, so the final int4 checkpoint lands much closer to the BF16 original. Google applied its QAT recipe to the Q4_0 format for every Gemma 4 size, and the released checkpoints hold near-full-precision quality at a Q4 footprint.

This is the same playbook Google ran for Gemma 3 in 2025, where QAT int4 dropped the 27B from 54 GB (BF16) to 14.1 GB, the 12B from 24 GB to 6.6 GB, and the 4B from 8 GB to 2.6 GB — all while staying within a few Elo points of the BF16 versions. Gemma 4 QAT extends that to the full current lineup, including the new 12B and the 26B-A4B MoE. If the general idea of quantization levels is new to you, our quantization explainer and the Q4 vs Q5 vs Q6 vs Q8 quality breakdown cover the fundamentals.

The new memory map

Here’s what each Gemma 4 QAT variant needs to run, per Google’s release notes and Unsloth’s GGUF sizing:

Model	Active params/token	QAT memory to run	Pre-QAT Q4 (for comparison)
E2B	~2.3B	~3 GB	~3 GB
E4B	~4.5B	~5 GB	~5 GB
12B Dense	12B	~7 GB	~10–12 GB
26B-A4B MoE	~4B (of 26B)	~15 GB	~15–17 GB
31B Dense	31B	~18 GB	~18–20 GB

Two things stand out. First, the E2B QAT checkpoint is about 1 GB on disk and runs in roughly 3 GB — small enough that Google is positioning it for phones and laptops with integrated graphics. Second, the 26B-A4B at ~15 GB is the entry that matters most for desktop home labs: it crosses back under the 16 GB line.

That’s the whole story for RTX 5060 Ti 16GB owners. The May guide measured the standard 26B at ~17 GB of real demand, 1 GB over the card’s capacity, which forced KV-cache overflow into system RAM as conversations grew. The QAT checkpoint lands at ~15 GB, leaving a small but workable margin for context on a 16 GB card. For long-document or multi-file code-review sessions you’ll still want to watch context length, but the cliff that hit the standard build at ~1,500 tokens is gone for normal chat and coding use.

For 24 GB cards, QAT is pure upside. The 31B Dense at ~18 GB leaves 6 GB for context and runtime buffers on a used RTX 3090 — the 256K context window becomes genuinely usable instead of theoretical. Used 3090 pricing has climbed this year on the GDDR7 shortage; our RTX 3090 value analysis and the 5060 Ti 16GB vs 3090 total-cost piece have the current numbers if you’re deciding between the two tiers.

The conversion trap: don’t roll your own GGUF

This is the part that’s tripped up early adopters, and it’s worth stating plainly: do not convert the Gemma 4 QAT Hugging Face checkpoints to GGUF yourself with a naive llama.cpp pass.

The reason is technical but concrete. The QAT checkpoints ship in BF16 with BF16 scales. llama.cpp’s Q4_0 format uses F16 scales. Converting QAT BF16 → llama.cpp Q4_0 is not lossless — the scale-format mismatch reintroduces exactly the kind of accuracy drop QAT was trained to avoid. People who did the straightforward conversion reported measurable quality regressions despite producing a larger file than the optimized version.

The fix is to use Unsloth’s pre-converted dynamic GGUFs. Unsloth ships a single recommended quant per model — UD-Q4_K_XL — that is both smaller and more accurate than a hand-rolled Q4_0 of the same checkpoint. The 26B-A4B UD-Q4_K_XL file is about 17 GB on disk and runs in ~15 GB. If you pull through Ollama, you skip the question entirely:

# QAT variants published in the Ollama library
ollama pull gemma4:e2b-it-qat
ollama pull gemma4:e4b-it-qat
ollama pull gemma4:12b-it-qat
ollama pull gemma4:26b-a4b-it-qat
ollama pull gemma4:31b-it-qat

Ollama handles the Unsloth-style quantization automatically for the -it-qat tags, so you don’t manage GGUF conversion at all. You do need an Ollama build with native Gemma 4 (gemma4) support — if ollama pull errors on an unknown model, update Ollama first. If you hit VRAM errors regardless of QAT, our CUDA out of memory fix guide covers the num_ctx and KV-cache settings that reclaim the most headroom.

vLLM: compressed-tensors, with one gap

If you’re serving batched, multi-user inference rather than running a single chat session, vLLM is the better target — and Google released QAT in compressed-tensors format for it. These checkpoints use 4-bit integer weights with 16-bit activations (W4A16, group_size=32), tagged -w4a16-ct in Google’s Hugging Face namespace.

The one gap: the 26B-A4B MoE is not in the W4A16 QAT set. Its expert dimension (704) is small enough that 4-bit quantization causes excessive quality loss, so Google shipped compressed-tensors checkpoints for E2B, E4B, 12B, and 31B only. For the 26B MoE on vLLM you fall back to a higher-precision format or run the GGUF path through llama.cpp/Ollama instead. If you’re weighing the two serving engines, our vLLM vs Ollama breakdown explains when each one wins.

Speed: QAT doesn’t make it faster, it makes it fit

A common misconception is that QAT boosts tokens/sec. It mostly doesn’t — decode speed on these models is bound by memory bandwidth, and the active-weight footprint per token is similar to standard Q4. What QAT buys you is fit: keeping the model entirely on-GPU instead of spilling to system RAM, which is where the real slowdowns came from.

Measured numbers are still early and hardware-specific. On a 96 GB Blackwell GPU running the 26B-A4B under vLLM, output landed around 38 tok/s at small context, dropping to ~25 tok/s near 229K context. There’s also a known bug worth flagging: the E4B currently runs at only ~9 tok/s on an RTX 4090 because of a forced TRITON_ATTN attention fallback, versus 124 tok/s on Blackwell — so if you’re on a 40-series card and the E4B feels inexplicably slow, that’s the cause, not your setup. Expect that to be patched in a future vLLM/attention-backend release.

For single-GPU GGUF inference, the 26B MoE’s per-token compute still tracks its ~4B active parameters, so the strong tokens/sec the original Gemma 4 guide reported (~149 tok/s on a 4090 for the 26B MoE) remains the right ballpark — QAT just makes that achievable on a 16 GB card instead of requiring 24 GB.

Quality: what you keep

The point of QAT is that the compression is nearly free. Gemma 4’s quality numbers — 31B Dense at 87.1% MMLU and an Elo of 1452, 26B MoE at 82.7% MMLU and Elo 1441, both above Qwen 3.5 27B at 1403 — are the BF16 figures, and the QAT Q4 checkpoints hold near-full-precision quality against them. That’s the whole pitch: you’re not trading reasoning accuracy for the VRAM savings the way a naive Q4 or Q3 quant would force you to.

Which GPU gets which Gemma 4 QAT model

VRAM	Best QAT match	Notes
4 GB	E2B QAT (~3 GB)	Runs on integrated graphics; E2B disk image ~1 GB
6–8 GB	E4B (~5 GB) or 12B (~7 GB)	8 GB cards run the 12B QAT with short context
12 GB	12B Dense (~7 GB)	Comfortable; room for long context
16 GB	26B-A4B (~15 GB)	The headline upgrade — fits now; cap context for long docs
24 GB	31B Dense (~18 GB)	Full 256K context headroom; 26B if you prefer speed

If you want to test the 31B before committing to a 24 GB card, an RTX 4090 or 5090 pod on RunPod runs the QAT GGUF for well under a dollar an hour. It’s the cheapest way to confirm whether the 31B’s quality edge over the 26B MoE justifies the hardware step for your workload before you buy. For the full decision math between renting and buying, see our RunPod vs local GPU breakdown.

FAQ

Does Gemma 4 QAT replace the regular Gemma 4 checkpoints? No — both are available. The QAT checkpoints (the -it-qat Ollama tags and Unsloth’s -qat-GGUF repos) are the ones to use if you’re running 4-bit, because they preserve more quality at the same size. The standard BF16/FP16 weights still exist for fine-tuning or full-precision serving.

Why does the 26B fit a 16 GB card now when it didn’t before? The standard Q4 build needed ~17 GB of real VRAM, just over a 16 GB card’s capacity, which forced KV-cache overflow to system RAM. The QAT UD-Q4_K_XL checkpoint runs in ~15 GB, leaving a workable margin so the model stays on-GPU for normal chat and coding sessions.

Can I just convert the Hugging Face QAT checkpoint to GGUF myself? You can, but don’t use a naive llama.cpp Q4_0 pass — QAT uses BF16 scales while llama.cpp Q4_0 uses F16 scales, and the mismatch reintroduces accuracy loss (often producing a larger and worse file). Use Unsloth’s pre-converted UD-Q4_K_XL GGUFs or pull the -it-qat tags through Ollama.

Is the QAT 26B available on vLLM? Not in W4A16 compressed-tensors form. Google shipped -w4a16-ct QAT checkpoints for E2B, E4B, 12B, and 31B; the 26B-A4B MoE was excluded because its small expert dimension (704) loses too much quality at 4-bit. Run the 26B MoE via Ollama/llama.cpp GGUF instead.

How much VRAM does each QAT model actually need? Roughly: E2B ~3 GB, E4B ~5 GB, 12B ~7 GB, 26B-A4B ~15 GB, 31B ~18 GB — about a 72% reduction from the BF16 baselines, before context. Add headroom for KV cache at long context lengths.

Sources

Last updated June 14, 2026. Model performance and hardware prices change; verify current listings before purchasing.

Recommended Gear

The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):

Was this article helpful?