Jun 12, 2026

DiffusionGemma 26B for Local AI in 2026: 18GB VRAM, 4× Faster Generation, and Which Consumer GPUs Actually Saturate the 1,000 tok/s Ceiling

By RunAIHome Team · 11 min read

googlediffusiongemmalocal-llmgpuvramnvfp4moe

TL;DR: DiffusionGemma 26B-A4B is Google DeepMind’s experimental text-diffusion model that denoises 256-token blocks in parallel instead of writing one token at a time. That gets you 1,000+ tok/s on an H100 and ~700 tok/s on an RTX 5090 — roughly 4× a same-size autoregressive model. The catch: the headline 18GB footprint needs NVFP4, which is Blackwell-only, and there’s no GGUF/Ollama path yet.

	RTX 5090 32GB	RTX 4090 24GB	DGX Spark 128GB
Best for	Native NVFP4, best home speed	Transformers path, no GGUF yet	Compact deskside, big context
Quantization	NVFP4 (18 GB)	bf16 (~52 GB won’t fit) / int4	NVFP4 (18 GB)
Generation speed	~700 tok/s	~200–400 tok/s (community)	150+ tok/s
Street price (Jun 2026)	$2,000+	~$2,250 used	$3,999

Honest take: DiffusionGemma is a research preview, not your daily driver — it scores below standard Gemma 4 on every published benchmark, and Google says so. But if you have a Blackwell card and a throughput-bound workload, the 4× speedup is real. On anything older than RTX 50-series, wait for llama.cpp support before you bother.

Google DeepMind released DiffusionGemma on June 10, 2026 under the Apache 2.0 license. It’s the first open-weights model in the Gemma line built on discrete text diffusion rather than autoregression, and the local-AI community noticed immediately because the speed numbers look like a different category of hardware. Before you git clone anything, you need to understand what diffusion text generation actually demands from your GPU — because it’s not the same shape as running a normal LLM.

What “text diffusion” changes about your hardware

A normal LLM — Llama, Qwen, standard Gemma — is autoregressive. It produces one token, feeds it back in, produces the next, and repeats. Throughput is gated by how fast you can do one forward pass per token, which on consumer hardware is dominated by memory bandwidth: every token requires streaming the active weights through the GPU.

DiffusionGemma works differently. It starts from a block of 256 masked “noise” tokens and iteratively denoises the entire block in parallel, refining all 256 positions at once and self-correcting earlier guesses as it goes. Evaluations use what Google calls the Entropy-Bounded Denoising with Adaptive Stopping (EB) sampler, capped at 48 denoising steps. So instead of 256 sequential forward passes to fill a 256-token block, you do up to 48 passes that each operate on the whole block. That’s where the parallelism — and the speed — comes from.

The architecture underneath is the Gemma 4 26B-A4B Mixture-of-Experts: 25.2B total parameters, ~3.8B active per step. It’s multimodal on input (text, image, and video in; text out) and supports a context window up to 256K tokens.

The practical consequence for your hardware: DiffusionGemma is more compute-bound and less purely bandwidth-bound than an equivalent autoregressive model, because each denoising step does dense work across a full block. That matters when you pick a GPU. A card with monster bandwidth but weak compute won’t extract the full 4× — and the format you can run determines whether you even get in the door.

The 18GB number, and the NVFP4 asterisk

Every headline says DiffusionGemma “fits in 18GB of VRAM.” That’s true, but only at NVFP4.

At bf16 the 25.2B weights occupy roughly 52GB — the whole expert set has to be resident even though only 3.8B activate per step, the same MoE memory trap that applies to Qwen 3.6 35B-A3B and every other A-class MoE. 52GB doesn’t fit on any single consumer card. To get to 18GB you need NVFP4, NVIDIA’s 4-bit floating-point format.

Here’s the part most write-ups skip: NVFP4 is a Blackwell-native format. It has hardware tensor-core support on RTX 50-series and the RTX PRO line, but not on Ada (RTX 40-series) or Ampere (RTX 30-series). So the clean 18GB-in-an-RTX-4090 story you see repeated is misleading — a 4090 has the 24GB capacity, but it can’t run NVFP4 with native acceleration. We cover the format in depth in the ComfyUI NVFP4 guide; the same generation rule applies here.

Format	VRAM (weights)	Native on	Reality in June 2026
bf16	~52 GB	All	Needs 2× 24GB or a 48GB+ card
FP8	~26 GB	Ada, Blackwell, Hopper	Datacenter path; tight on 24GB consumer
NVFP4	~18 GB	Blackwell (RTX 50-series, RTX PRO)	The “18GB” headline number
GGUF int4	~16 GB (projected)	Any (via llama.cpp)	Not available yet

That last row is the one that stings for most home labs. As of mid-June 2026, llama.cpp GGUF support for DiffusionGemma’s block-diffusion sampler is still an open PR, not a release. No GGUF means no Ollama and no LM Studio yet — those wrap llama.cpp. Day-zero support shipped for vLLM, HuggingFace Transformers, MLX, Unsloth, and NVIDIA NeMo, so the supported local path today is vLLM or raw Transformers, not the one-line ollama pull most readers want. If your stack is Ollama-first, see our vLLM vs Ollama breakdown for what switching costs you.

Real speed numbers, and the 1,000 tok/s ceiling

The “1,000 tokens per second” headline is a datacenter number. At batch size 1, the FP8 build reaches about 1,008 tok/s on a single H100 and 1,288 tok/s on an H200; NVIDIA quotes up to 2,000 tok/s on a DGX Station. Those are the figures behind “4× faster” — for reference, autoregressive Gemma 4 27B does roughly 40 tok/s on an RTX 4090.

What you actually get at home:

Hardware	Memory BW	DiffusionGemma speed	Notes
H100 SXM	~3.35 TB/s	~1,008 tok/s (FP8)	The “1,000 tok/s” headline
H200	~4.8 TB/s	~1,288 tok/s (FP8)	Datacenter
RTX 5090 32GB	1,792 GB/s	~700 tok/s	Best consumer number, native NVFP4
DGX Spark 128GB	(LPDDR5X)	150+ tok/s	Compact deskside, huge context headroom
RTX 4090 24GB	1,008 GB/s	~200–400 tok/s	Community estimate; no native NVFP4

So no consumer card saturates the 1,000 tok/s ceiling — that requires H100-class bandwidth. The RTX 5090 gets closest at ~700 tok/s because it pairs 1,792 GB/s of bandwidth with native NVFP4 tensor cores. It is, today, the only consumer GPU that runs DiffusionGemma the way it was designed to run. The RTX PRO 6000 Blackwell also qualifies and adds 96GB for long-context work, but at workstation prices.

The RTX 4090 is the interesting tweener. It has the VRAM and the bandwidth, but no NVFP4 acceleration, so you’re stuck running a heavier format through Transformers — community reports land around 200–400 tok/s. That’s still several times faster than autoregressive Gemma 4 on the same card, but it’s not the 4× story, and you’re paying ~$2,250 used for a card that’s now mid-pack for this model.

What about RTX 3090 and the budget tier?

This is where DiffusionGemma diverges hard from the usual local-AI advice. Normally a used RTX 3090 — around $1,050 on eBay in June 2026, down from its peak but no longer the $500 bargain it once was — is the value king for 24GB workloads. Here it’s a poor fit:

No NVFP4. Ampere can’t accelerate the format that makes 18GB possible.
No GGUF yet. The int4 path that would let a 3090 run this hasn’t shipped.
Bandwidth gap. At 936 GB/s the 3090 trails the 5090’s 1,792 GB/s by nearly half, and diffusion’s compute-heavy steps don’t favor Ampere.

If you own a 3090, the right move is to keep running autoregressive Gemma 4 or Qwen on it and revisit DiffusionGemma when llama.cpp lands a GGUF. Buying a 3090 for DiffusionGemma makes no sense today.

For anyone who just wants to try the model without buying Blackwell, renting is the rational call: an H100 or RTX 5090 instance on RunPod lets you run the supported NVFP4/FP8 path for a few dollars an hour and see whether the diffusion approach fits your workload before committing to hardware.

The quality caveat nobody should skip

Speed is the whole pitch, and speed comes at a cost. On Google’s own published benchmarks, DiffusionGemma 26B-A4B scores 77.6% on MMLU Pro, 73.2% on GPQA Diamond, and 70.5% on MATH-Vision — respectable numbers, but below standard Gemma 4 on every task measured. Google explicitly recommends Gemma 4 for production and frames DiffusionGemma as experimental.

That reframes who this model is for. It is not a quality upgrade. It’s a latency-and-throughput tool: useful when you need to generate a lot of acceptable-quality text fast (bulk drafting, synthetic data, low-latency UI fills, agentic loops where you re-roll cheaply), and a downgrade when you need the single best answer. Match the tool to the job and the 4× is a real win. Reach for it as a Gemma 4 replacement and you’ll be disappointed.

How to actually run it today

The supported local path right now is vLLM on a Blackwell card. The short version:

# Requires a Blackwell GPU (RTX 50-series / RTX PRO) for native NVFP4
pip install -U vllm
vllm serve google/diffusiongemma-26B-A4B-it \
  --quantization nvfp4 \
  --max-model-len 8192

Expected behavior on an RTX 5090: the NVFP4 weights load into ~18GB, leaving headroom for KV cache at moderate context, and you’ll see block-parallel generation in the ~700 tok/s range at batch size 1. If you try this on an RTX 4090, drop the --quantization nvfp4 flag and expect a larger footprint and lower speed, since you’ll fall back to a non-native format.

A common first error on 16GB cards:

torch.OutOfMemoryError: CUDA out of memory

There’s no fix on a 16GB card today — the RTX 4080’s 16GB is below the floor once you add KV cache, even at NVFP4. 18GB usable is the practical minimum, which in consumer terms means a 24GB-or-larger card. Wait for GGUF if all you have is 16GB.

FAQ

Can I run DiffusionGemma in Ollama? Not yet. Ollama wraps llama.cpp, and llama.cpp’s GGUF support for DiffusionGemma’s diffusion sampler is an open pull request as of June 2026, not a release. Until that merges, use vLLM or HuggingFace Transformers.

Is the 18GB VRAM figure real? Yes, but only with NVFP4 quantization, which has native hardware support only on Blackwell GPUs (RTX 50-series and RTX PRO). At bf16 the model needs ~52GB.

Will it run on an RTX 4090? It can, but not at NVFP4 with native acceleration — the 4090 is Ada, not Blackwell. Community reports put it around 200–400 tok/s via a heavier format in Transformers, versus ~700 tok/s native on an RTX 5090.

Is it better than regular Gemma 4? No. DiffusionGemma scores below Gemma 4 on every published benchmark. It’s faster, not smarter. Google recommends Gemma 4 for production and treats DiffusionGemma as experimental.

What’s the actual speedup? Up to ~4× versus a same-size autoregressive model — about 1,008 tok/s on H100 (FP8) and ~700 tok/s on RTX 5090, against roughly 40 tok/s for autoregressive Gemma 4 27B on an RTX 4090.

Does it do coding like other Gemma models? It handles general text, with multimodal input (text, image, video). For dedicated local coding, a purpose-built model like Qwen 3.6 35B-A3B is a better fit. For cloud coding tools, see our sister site aicoderscope.com.

Sources

Last updated June 12, 2026. Prices and specs change; verify current rates before purchasing. DiffusionGemma is an experimental research preview — confirm framework support before building on it.

Was this article helpful?