Google Gemma 4 for Local AI: Which Size Fits Your GPU? (2026 Guide)
Gemma 4 launched April 2, 2026, with four variants under Apache 2.0—Google’s first Gemma release without the custom license that enterprise legal teams had been flagging as a blocker. The headline number circulating through local AI communities within hours of launch: the 26B MoE generates approximately 149 tokens per second on an RTX 4090.
That’s not a typo. The 26B “A4B” model uses Mixture-of-Experts routing that activates only about 4 billion of its 26 billion parameters per token inference. You’re getting near-26B reasoning quality at something close to 4B-class compute. What that means for GPU selection is non-obvious—and that’s what this piece works through.
The four variants decoded
Google structured Gemma 4 as two size tiers, each with two architectures. The naming confused people initially, so here’s what each label actually means:
| Model | Architecture | Active params/token | Context window | Multimodal |
|---|---|---|---|---|
| E2B | Dense + PLE | ~2.3B | 128K | Vision + Audio |
| E4B | Dense + PLE | ~4.5B | 128K | Vision + Audio |
| 26B A4B | Mixture-of-Experts | ~4B (of 26B total) | 256K | Vision only |
| 31B Dense | Dense | 31B | 256K | Vision only |
“E” stands for Efficient, not Enterprise. The E2B and E4B use Per-Layer Embeddings (PLE)—a technique that packs more parameter capacity into less active computation than a standard dense architecture. They’re edge-optimized and designed for tight memory budgets. Both also support native audio input (automatic speech recognition and speech-to-translated-text), a capability neither the 26B nor 31B has.
“26B A4B” means 26 billion total parameters with approximately 4 billion active per forward pass via MoE routing. The entire 26B weight file must still load into VRAM—that’s the catch discussed in detail below—but per-token compute tracks with a 4B model, which is why the speed numbers look anomalous.
The two large models (26B and 31B) both have 256K context windows. The E-series cap at 128K.
VRAM requirements at each quantization level
The Q4_K_M GGUF for the 26B MoE is approximately 14 GB on disk. Runtime VRAM consumption is higher: KV cache, activations, and runtime buffers add 1–3 GB at short context, pushing total usage to 15–17 GB. At 32K+ context the KV cache adds more, reaching 18–22 GB total; at the model’s maximum 256K context it’s impractical on consumer VRAM without extensive offloading.
| Model | Q4_K_M VRAM | Q8_0 VRAM | FP16 VRAM |
|---|---|---|---|
| E2B | ~3 GB | ~5 GB | ~10 GB |
| E4B | ~5 GB | ~8 GB | ~18 GB |
| 26B A4B MoE | ~15–17 GB | ~28 GB | ~55 GB |
| 31B Dense | ~18–20 GB | ~32 GB | ~62 GB |
Q4_K_M is the default Ollama quantization for both large models and the practical floor for usable quality on creative and coding tasks. Dropping the 26B MoE to Q3_K_M reduces the GGUF to ~12 GB (workable on 12–16 GB cards) but introduces measurable degradation on structured reasoning. Q8 for the large models requires 28–32 GB of VRAM or splits uncomfortably between VRAM and system RAM.
Inference speed: the MoE advantage, and its ceiling
The RTX 4090’s memory bandwidth is 1,008 GB/s. For the 26B MoE at Q4_K_M, the per-token active weight window is approximately 4B × 0.5 bytes = ~2 GB—the MoE routing fetches only the active expert fraction of the full model per token. Result: approximately 149 tokens per second in benchmarks on that hardware.
For the 31B Dense, all 31 billion parameters activate per forward pass. At Q4_K_M (~18 GB loaded), the full weight file sweeps through the memory bus per token. Theoretical ceiling ≈ 1,008 GB/s ÷ 18 GB ≈ 56 tok/s; real-world lands at approximately 28–35 tok/s with short context. When 128K+ context forces KV cache overflow to system RAM (DDR5 quad-channel ~50–60 GB/s), that drops to ~7–8 tok/s—a number that appeared in early benchmarks when testers hit the maximum context window.
| GPU | VRAM | 26B MoE Q4 tok/s | 31B Dense Q4 short ctx tok/s |
|---|---|---|---|
| RTX 5060 Ti 16GB | 16 GB | 40–50 (context limited) | — |
| RTX 5070 Ti 16GB | 16 GB | ~70 (context limited) | — |
| RTX 3090 24GB | 24 GB | 64–119 | ~26–30 |
| RTX 4090 24GB | 24 GB | ~149 | ~28–35 |
The “context limited” entries are not a small-print caveat—they’re the reason the 16 GB decision is genuinely complicated. The 31B entries are blank for 16 GB cards because the model simply doesn’t fit at Q4; partial RAM offloading drops generation to CPU bandwidth speeds (~5–10 tok/s), not practical for interactive use.
The 16 GB trap: what actually happens on an RTX 5060 Ti or 5070 Ti
If you own an RTX 5060 Ti 16GB, RTX 5070 Ti 16GB, or RTX 4060 Ti 16GB, ollama pull gemma4:26b will succeed and inference will start. The problem emerges as conversations lengthen.
gpuforllm.com measures the 26B A4B’s actual VRAM demand at approximately 17 GB—1 GB over the physical limit. Ollama compensates by adjusting batch sizes and offloading KV cache overflow to system RAM once context accumulates. For very short sessions (under ~1,500 tokens of total context), you may never notice. For document analysis, extended research sessions, or code reviews that span multiple files, generation speed will drop toward single digits as the session grows.
Practical options for 16 GB owners:
- Create a custom Modelfile with
PARAMETER num_ctx 2048to cap context and keep everything on-GPU at 40–70 tok/s - Use Q3_K_M (~12 GB) if a pre-quantized version is available, with some quality trade-off
- Fall back to E4B Q8 (~8 GB) for full 128K context without VRAM pressure
The 26B MoE with genuine 256K context headroom requires 24 GB. For hardware comparisons at the 16 GB tier—including bandwidth and real inference speed differences between the RTX 5060 Ti and 5070 Ti—see our RTX 5070 Ti vs RTX 5080 breakdown.
If you’re considering 24 GB, the used RTX 3090 currently runs $895–$1,200 on eBay (May 2026 completed listings). For the full 3-year cost math on that decision, including electricity and residual value, see our RTX 3090 value analysis.
Quality benchmarks
Gemma 4’s 31B Dense tops the open-weight under-70B category on math, reasoning, and code as of April 2026:
| Model | MMLU | AIME 2026 | LiveCodeBench v6 |
|---|---|---|---|
| Gemma 4 31B Dense | 85.2% | 89.2% | 80.0% |
| Gemma 4 26B MoE | 82.6% | — | — |
| Llama 3.3 70B | 86.0% | — | — |
The gap between 26B MoE and 31B Dense is approximately 2.6 percentage points on MMLU. That gap matters at the extreme edge of competitive math and multi-hop legal reasoning. For typical chat, coding assistance, and summarization it’s invisible—and the 26B MoE’s 4–5× speed advantage on identical hardware is not.
The E4B at 42.5% on AIME 2026 is worth noting for a sub-5B model. Previous-generation Gemma 3 4B couldn’t approach that on hard math. The PLE architecture genuinely extracts more reasoning per active parameter than standard dense models at that size class.
For coding specifically: LiveCodeBench v6 tracks real competitive programming problems rather than synthetic code completion. The 31B at 80.0% is competitive with larger closed-weight models. If you’re evaluating local AI for development workflows, aicoderscope.com covers the full AI coding tool landscape including local model comparisons.
The Apache 2.0 detail that matters
Gemma 1, 2, and 3 all launched under a custom Google license with enough commercial-use ambiguity that enterprise legal teams routinely blocked deployments. Gemma 4 ships under Apache 2.0—the same terms as Qwen3 and most Mistral releases. No enterprise carve-outs, no revenue thresholds, no special redistribution clauses. You can deploy it in a commercial product, fine-tune and redistribute it, and build SaaS features on top of it without legal review flags.
For personal or research use this changes nothing. For anything customer-facing or distributed as a commercial service, it’s a genuine unlock that Gemma 3 didn’t offer.
The privacy story is clean by default: weights run entirely on local hardware with no outbound connections required. See the local AI privacy audit for a full telemetry breakdown of Ollama and related tools.
Which GPU gets which Gemma 4 model
| VRAM | Best Gemma 4 match | Notes |
|---|---|---|
| 4 GB | E2B Q4_K_M | ~3 GB fit; runs on integrated graphics |
| 6–8 GB | E4B Q4_K_M | Comfortable; 8 GB cards can run E4B Q8 |
| 12 GB | E4B Q8 or 26B Q3 | 26B at Q3 ~12 GB; noticeable quality trade-off |
| 16 GB | 26B MoE Q4 | Cap context or expect slowdown past ~1,500 tokens |
| 24 GB | 26B MoE Q4 (recommended) or 31B Q4 | Full context; 26B is faster, 31B benchmarks higher |
If you want to run the 31B Dense without committing to 24 GB hardware, an RTX 4090 Community pod on RunPod runs it at approximately $0.34/hr (May 2026 pricing). At 28–35 tok/s for the 31B, it’s a practical way to evaluate whether the quality difference over the 26B MoE justifies a hardware upgrade.
Honest take
The 26B MoE is the model to run in 2026 if you have 24 GB VRAM. At ~149 tok/s on a 4090, latency is low enough that context depth matters more than raw speed—and 256K context with 82.6% MMLU puts it in the same tier as Llama 3.3 70B at a fraction of the compute cost. The 31B Dense earns its keep only if your workload regularly hits hard math or complex multi-hop reasoning where that 2.6-point MMLU gap surfaces. Most users won’t encounter it in daily use.
For 16 GB GPU owners: the 26B MoE technically runs and is worth trying. If your sessions stay under 1,500 tokens, you’ll see full 40–70 tok/s with no observable degradation. If you’re doing document analysis or sustained research sessions, E4B Q8 delivers a better experience than the 26B MoE fighting its VRAM ceiling with a 3,000-token context.
The E4B is genuinely good at 8 GB. AIME 42.5% from a sub-5B model doesn’t lie. It’s not a replacement for the 26B on complex tasks, but it’s the first Gemma small model where the quality feels like it came from a real language model rather than an aggressively compressed one.
Frequently Asked Questions
Does the Gemma 4 26B MoE actually fit on a 16 GB GPU?
Technically yes, but the model demands approximately 17 GB of VRAM at full capacity. On 16 GB cards it loads and responds, but conversations beyond ~1,500 tokens push KV cache overflow into system RAM and generation speed drops from 40–70 tok/s to single digits. Set PARAMETER num_ctx 2048 in an Ollama Modelfile to cap context and maintain full GPU speed.
What’s the real performance difference between the 26B MoE and the 31B Dense? On an RTX 4090: ~149 tok/s for the 26B MoE vs ~28–35 tok/s for the 31B Dense—a 4–5× speed difference because MoE routing activates only ~4B parameters per token rather than the full 31B. The 31B scores approximately 2.6 percentage points higher on MMLU. For most chat and coding tasks the MoE’s speed advantage matters more than the Dense model’s benchmark edge.
Is Gemma 4 free for commercial use? Yes. Gemma 4 uses the Apache 2.0 license—permissive for commercial deployment, fine-tuning, redistribution, and derivative models with no revenue thresholds or enterprise carve-outs. Previous Gemma releases used a custom Google license with commercial-use restrictions that many legal teams blocked.
Do E2B and E4B process audio? Yes. Both E-series models natively accept text, images, and audio input within a 128K context window, including speech recognition and speech-to-translated-text across multiple languages. The 26B and 31B accept text and images only—no audio—with a 256K context window.
How do I pull Gemma 4 with Ollama?
ollama pull gemma4:2b (E2B), ollama pull gemma4:4b (E4B), ollama pull gemma4:26b (26B MoE, default Q4_K_M), ollama pull gemma4:31b (31B Dense, default Q4_K_M). The 26B download is approximately 14 GB; allow 15–30 minutes on a typical home connection.
1V1 PLAYBOOK · LOCAL LLM
Cut your local AI bill from $400/month cloud GPU to $47/month at home.
4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.
Get it for $19 (early bird) →Sources
- Gemma 4: Byte for byte, the most capable open models — Google Blog
- Google Launches Gemma 4 Open AI Models Under Apache 2.0 License — MLQ.ai
- Gemma 4 Benchmarks: MMLU 85.2%, AIME 89.2% — Gemma4.online
- Benchmarking Google Gemma 4 26B and 31B Locally — n1n.ai
- Can I Run Gemma 4 26B A4B on RTX 5060 Ti 16GB? VRAM Required: 17.0 GB — gpuforllm.com
- Gemma 4 26B in the Ollama Library — Ollama
- unsloth/gemma-4-26B-A4B-it-GGUF File Sizes — Hugging Face
- Gemma 4, Phi-4, and Qwen3: Accuracy–Efficiency Tradeoffs — arXiv 2604.07035
- E2B? E4B? 26B A4B? The Gemma 4 Model Names Explained — DEV Community
- Gemma 4 on RTX 3090 vs 4090 vs 5090 vs Mac Benchmarks — YouTube
- What Is Gemma 4? Native Audio and Vision Under Apache 2.0 — MindStudio
- Google Releases Gemma 4 in Four Model Sizes Under Apache 2.0 — gHacks Tech News
Last updated May 26, 2026. Model performance and hardware prices change; verify current listings before purchasing.
Recommended Gear
The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →