Google Gemma 4 for Local AI: Which Size Fits Your GPU? (2026 Guide)

gemmagooglelocal-aillmgpuvramollamabenchmark

Gemma 4 launched April 2, 2026, with four variants under Apache 2.0—Google’s first Gemma release without the custom license that enterprise legal teams had been flagging as a blocker. The headline number circulating through local AI communities within hours of launch: the 26B MoE generates approximately 149 tokens per second on an RTX 4090.

That’s not a typo. The 26B “A4B” model uses Mixture-of-Experts routing that activates only about 4 billion of its 26 billion parameters per token inference. You’re getting near-26B reasoning quality at something close to 4B-class compute. What that means for GPU selection is non-obvious—and that’s what this piece works through.

The four variants decoded

Google structured Gemma 4 as two size tiers, each with two architectures. The naming confused people initially, so here’s what each label actually means:

ModelArchitectureActive params/tokenContext windowMultimodal
E2BDense + PLE~2.3B128KVision + Audio
E4BDense + PLE~4.5B128KVision + Audio
26B A4BMixture-of-Experts~4B (of 26B total)256KVision only
31B DenseDense31B256KVision only

“E” stands for Efficient, not Enterprise. The E2B and E4B use Per-Layer Embeddings (PLE)—a technique that packs more parameter capacity into less active computation than a standard dense architecture. They’re edge-optimized and designed for tight memory budgets. Both also support native audio input (automatic speech recognition and speech-to-translated-text), a capability neither the 26B nor 31B has.

“26B A4B” means 26 billion total parameters with approximately 4 billion active per forward pass via MoE routing. The entire 26B weight file must still load into VRAM—that’s the catch discussed in detail below—but per-token compute tracks with a 4B model, which is why the speed numbers look anomalous.

The two large models (26B and 31B) both have 256K context windows. The E-series cap at 128K.

VRAM requirements at each quantization level

The Q4_K_M GGUF for the 26B MoE is approximately 14 GB on disk. Runtime VRAM consumption is higher: KV cache, activations, and runtime buffers add 1–3 GB at short context, pushing total usage to 15–17 GB. At 32K+ context the KV cache adds more, reaching 18–22 GB total; at the model’s maximum 256K context it’s impractical on consumer VRAM without extensive offloading.

ModelQ4_K_M VRAMQ8_0 VRAMFP16 VRAM
E2B~3 GB~5 GB~10 GB
E4B~5 GB~8 GB~18 GB
26B A4B MoE~15–17 GB~28 GB~55 GB
31B Dense~18–20 GB~32 GB~62 GB

Q4_K_M is the default Ollama quantization for both large models and the practical floor for usable quality on creative and coding tasks. Dropping the 26B MoE to Q3_K_M reduces the GGUF to ~12 GB (workable on 12–16 GB cards) but introduces measurable degradation on structured reasoning. Q8 for the large models requires 28–32 GB of VRAM or splits uncomfortably between VRAM and system RAM.

Inference speed: the MoE advantage, and its ceiling

The RTX 4090’s memory bandwidth is 1,008 GB/s. For the 26B MoE at Q4_K_M, the per-token active weight window is approximately 4B × 0.5 bytes = ~2 GB—the MoE routing fetches only the active expert fraction of the full model per token. Result: approximately 149 tokens per second in benchmarks on that hardware.

For the 31B Dense, all 31 billion parameters activate per forward pass. At Q4_K_M (~18 GB loaded), the full weight file sweeps through the memory bus per token. Theoretical ceiling ≈ 1,008 GB/s ÷ 18 GB ≈ 56 tok/s; real-world lands at approximately 28–35 tok/s with short context. When 128K+ context forces KV cache overflow to system RAM (DDR5 quad-channel ~50–60 GB/s), that drops to ~7–8 tok/s—a number that appeared in early benchmarks when testers hit the maximum context window.

GPUVRAM26B MoE Q4 tok/s31B Dense Q4 short ctx tok/s
RTX 5060 Ti 16GB16 GB40–50 (context limited)
RTX 5070 Ti 16GB16 GB~70 (context limited)
RTX 3090 24GB24 GB64–119~26–30
RTX 4090 24GB24 GB~149~28–35

The “context limited” entries are not a small-print caveat—they’re the reason the 16 GB decision is genuinely complicated. The 31B entries are blank for 16 GB cards because the model simply doesn’t fit at Q4; partial RAM offloading drops generation to CPU bandwidth speeds (~5–10 tok/s), not practical for interactive use.

The 16 GB trap: what actually happens on an RTX 5060 Ti or 5070 Ti

If you own an RTX 5060 Ti 16GB, RTX 5070 Ti 16GB, or RTX 4060 Ti 16GB, ollama pull gemma4:26b will succeed and inference will start. The problem emerges as conversations lengthen.

gpuforllm.com measures the 26B A4B’s actual VRAM demand at approximately 17 GB—1 GB over the physical limit. Ollama compensates by adjusting batch sizes and offloading KV cache overflow to system RAM once context accumulates. For very short sessions (under ~1,500 tokens of total context), you may never notice. For document analysis, extended research sessions, or code reviews that span multiple files, generation speed will drop toward single digits as the session grows.

Practical options for 16 GB owners:

  • Create a custom Modelfile with PARAMETER num_ctx 2048 to cap context and keep everything on-GPU at 40–70 tok/s
  • Use Q3_K_M (~12 GB) if a pre-quantized version is available, with some quality trade-off
  • Fall back to E4B Q8 (~8 GB) for full 128K context without VRAM pressure

The 26B MoE with genuine 256K context headroom requires 24 GB. For hardware comparisons at the 16 GB tier—including bandwidth and real inference speed differences between the RTX 5060 Ti and 5070 Ti—see our RTX 5070 Ti vs RTX 5080 breakdown.

If you’re considering 24 GB, the used RTX 3090 currently runs $895–$1,200 on eBay (May 2026 completed listings). For the full 3-year cost math on that decision, including electricity and residual value, see our RTX 3090 value analysis.

Quality benchmarks

Gemma 4’s 31B Dense tops the open-weight under-70B category on math, reasoning, and code as of April 2026:

ModelMMLUAIME 2026LiveCodeBench v6
Gemma 4 31B Dense85.2%89.2%80.0%
Gemma 4 26B MoE82.6%
Llama 3.3 70B86.0%

The gap between 26B MoE and 31B Dense is approximately 2.6 percentage points on MMLU. That gap matters at the extreme edge of competitive math and multi-hop legal reasoning. For typical chat, coding assistance, and summarization it’s invisible—and the 26B MoE’s 4–5× speed advantage on identical hardware is not.

The E4B at 42.5% on AIME 2026 is worth noting for a sub-5B model. Previous-generation Gemma 3 4B couldn’t approach that on hard math. The PLE architecture genuinely extracts more reasoning per active parameter than standard dense models at that size class.

For coding specifically: LiveCodeBench v6 tracks real competitive programming problems rather than synthetic code completion. The 31B at 80.0% is competitive with larger closed-weight models. If you’re evaluating local AI for development workflows, aicoderscope.com covers the full AI coding tool landscape including local model comparisons.

The Apache 2.0 detail that matters

Gemma 1, 2, and 3 all launched under a custom Google license with enough commercial-use ambiguity that enterprise legal teams routinely blocked deployments. Gemma 4 ships under Apache 2.0—the same terms as Qwen3 and most Mistral releases. No enterprise carve-outs, no revenue thresholds, no special redistribution clauses. You can deploy it in a commercial product, fine-tune and redistribute it, and build SaaS features on top of it without legal review flags.

For personal or research use this changes nothing. For anything customer-facing or distributed as a commercial service, it’s a genuine unlock that Gemma 3 didn’t offer.

The privacy story is clean by default: weights run entirely on local hardware with no outbound connections required. See the local AI privacy audit for a full telemetry breakdown of Ollama and related tools.

Which GPU gets which Gemma 4 model

VRAMBest Gemma 4 matchNotes
4 GBE2B Q4_K_M~3 GB fit; runs on integrated graphics
6–8 GBE4B Q4_K_MComfortable; 8 GB cards can run E4B Q8
12 GBE4B Q8 or 26B Q326B at Q3 ~12 GB; noticeable quality trade-off
16 GB26B MoE Q4Cap context or expect slowdown past ~1,500 tokens
24 GB26B MoE Q4 (recommended) or 31B Q4Full context; 26B is faster, 31B benchmarks higher

If you want to run the 31B Dense without committing to 24 GB hardware, an RTX 4090 Community pod on RunPod runs it at approximately $0.34/hr (May 2026 pricing). At 28–35 tok/s for the 31B, it’s a practical way to evaluate whether the quality difference over the 26B MoE justifies a hardware upgrade.

Honest take

The 26B MoE is the model to run in 2026 if you have 24 GB VRAM. At ~149 tok/s on a 4090, latency is low enough that context depth matters more than raw speed—and 256K context with 82.6% MMLU puts it in the same tier as Llama 3.3 70B at a fraction of the compute cost. The 31B Dense earns its keep only if your workload regularly hits hard math or complex multi-hop reasoning where that 2.6-point MMLU gap surfaces. Most users won’t encounter it in daily use.

For 16 GB GPU owners: the 26B MoE technically runs and is worth trying. If your sessions stay under 1,500 tokens, you’ll see full 40–70 tok/s with no observable degradation. If you’re doing document analysis or sustained research sessions, E4B Q8 delivers a better experience than the 26B MoE fighting its VRAM ceiling with a 3,000-token context.

The E4B is genuinely good at 8 GB. AIME 42.5% from a sub-5B model doesn’t lie. It’s not a replacement for the 26B on complex tasks, but it’s the first Gemma small model where the quality feels like it came from a real language model rather than an aggressively compressed one.

Frequently Asked Questions

Does the Gemma 4 26B MoE actually fit on a 16 GB GPU? Technically yes, but the model demands approximately 17 GB of VRAM at full capacity. On 16 GB cards it loads and responds, but conversations beyond ~1,500 tokens push KV cache overflow into system RAM and generation speed drops from 40–70 tok/s to single digits. Set PARAMETER num_ctx 2048 in an Ollama Modelfile to cap context and maintain full GPU speed.

What’s the real performance difference between the 26B MoE and the 31B Dense? On an RTX 4090: ~149 tok/s for the 26B MoE vs ~28–35 tok/s for the 31B Dense—a 4–5× speed difference because MoE routing activates only ~4B parameters per token rather than the full 31B. The 31B scores approximately 2.6 percentage points higher on MMLU. For most chat and coding tasks the MoE’s speed advantage matters more than the Dense model’s benchmark edge.

Is Gemma 4 free for commercial use? Yes. Gemma 4 uses the Apache 2.0 license—permissive for commercial deployment, fine-tuning, redistribution, and derivative models with no revenue thresholds or enterprise carve-outs. Previous Gemma releases used a custom Google license with commercial-use restrictions that many legal teams blocked.

Do E2B and E4B process audio? Yes. Both E-series models natively accept text, images, and audio input within a 128K context window, including speech recognition and speech-to-translated-text across multiple languages. The 26B and 31B accept text and images only—no audio—with a 256K context window.

How do I pull Gemma 4 with Ollama? ollama pull gemma4:2b (E2B), ollama pull gemma4:4b (E4B), ollama pull gemma4:26b (26B MoE, default Q4_K_M), ollama pull gemma4:31b (31B Dense, default Q4_K_M). The 26B download is approximately 14 GB; allow 15–30 minutes on a typical home connection.

1V1 PLAYBOOK · LOCAL LLM

Cut your local AI bill from $400/month cloud GPU to $47/month at home.

4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.

Get it for $19 (early bird) →

Sources

Last updated May 26, 2026. Model performance and hardware prices change; verify current listings before purchasing.


The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):

Was this article helpful?