Jun 8, 2026

ComfyUI NVFP4 in 2026: 3× Faster Image Generation on RTX 50-Series (and the Right Format for RTX 40-Series)

By RunAIHome Team · 13 min read

comfyuinvfp4fluximage-generationrtx-5090rtx-4090quantizationgpublackwell

TL;DR: NVFP4 is a Blackwell-exclusive quantization format that pushes FLUX 1 Dev to 7.73 it/s — 118% faster than GGUF Q8 and 84% faster than FP8 Scaled — while cutting VRAM from 26 GB (BF16) to 14 GB. The catch: it requires CUDA 13.0 and an RTX 50-series GPU. On RTX 40-series, NVFP4 delivers no speedup and can actually run 2× slower than FP8 if you don’t have the right PyTorch build. RTX 40-series owners should use FP8 Scaled instead.

	RTX 50-Series + NVFP4	RTX 40/30-Series + FP8 Scaled	RTX 40/30-Series + BF16
Best for	Maximum throughput on Blackwell	Speed + quality on Ada/Ampere	Full fidelity, no quality loss
FLUX 1 Dev speed	7.73 it/s	4.21 it/s	4.53 it/s
VRAM (FLUX SRPO)	14 GB	~17 GB	26 GB
The catch	RTX 50-series only, needs CUDA 13	No hardware FP8 speedup on 30-series	24+ GB card mandatory

Honest take: If you own an RTX 50-series card, NVFP4 with PyTorch cu130 is the single highest-impact setting change you can make — 7 minutes to set up, nearly 2× faster generation immediately. If you’re on RTX 40-series, skip NVFP4 entirely and use FP8 Scaled checkpoints, which give you 40% VRAM savings with near-identical quality.

What NVFP4 Actually Is

NVFP4 is NVIDIA’s own 4-bit floating-point quantization format, introduced with Blackwell architecture. It is not the same as GGUF Q4, NF4, or bitsandbytes FP4 — those are generic community formats that fall back to software emulation on any hardware. NVFP4 uses dedicated FP4 instructions wired into the 5th-generation Tensor Cores on Blackwell’s sm120 architecture. The math runs natively in silicon.

The format uses a two-level scaling scheme: a global scale factor per tensor, plus per-block scale factors. This preserves dynamic range better than naive 4-bit truncation, which is why quality degradation is minimal on most FLUX workflows despite the aggressive compression.

RTX 40-series (Ada Lovelace, sm89) has FP8 tensor cores but no FP4 datapath. NVFP4 will technically load on an RTX 4090, but without native FP4 acceleration, PyTorch falls back to software emulation — which is why NVIDIA explicitly warns that running NVFP4 without PyTorch cu130 can be up to 2× slower than FP8. That’s not a misconfiguration; it’s the expected behavior when emulating FP4 math on hardware built for FP8.

The Numbers: FLUX 1 Dev on RTX 5090 with CUDA 13

Benchmarks from Furkan Gözükara’s FLUX precision comparison (RTX 5090, CUDA 13, 2048px, Quality 1 preset) on FLUX 1 Dev:

Format	Speed (it/s)	vs GGUF Q8
NVFP4	7.73	+118%
BF16	4.53	+28%
FP8 Scaled	4.21	+19%
GGUF Q8 (baseline)	3.54	—

For FLUX SRPO on the same hardware: 5.7 seconds for 40 steps at NVFP4, using 14 GB VRAM vs 26 GB for the BF16 equivalent — a 46% reduction in VRAM footprint.

For reference, raw FLUX Dev FP8 generation times across GPU tiers (from the ComfyUI GitHub benchmark discussion, 20 steps, standard workflow):

GPU	Time (s)	Speed (it/s)
RTX 5090	5.46	3.66
RTX 5080	6.67	3.23
RTX 5060 Ti	25.71	1.20
RTX 4090	11.28	1.85
RTX 3090	~26	~0.77

NVFP4 on a properly configured RTX 5090 takes FLUX dev from 5.46s (FP8) to approximately 2.6–3.0 seconds per generation — matching NVIDIA’s publicized “5 seconds for FLUX dev” figure on RTX 5090 at FP4. Meanwhile an RTX 4090 on FP8 lands at 11.28 seconds — still 2× slower than a Blackwell mid-range doing NVFP4, even though the 4090 nominally outspecs the 5080 on paper in other metrics.

GPU Tier Guide: Which Format to Use

RTX 50-Series (RTX 5060 Ti, 5070, 5070 Ti, 5080, 5090): Use NVFP4

Every RTX 50-series card — including the RTX 5060 Ti 16GB at the budget end — carries Blackwell’s sm120 Tensor Cores with native FP4 hardware. NVFP4 is your native format. The speed advantage over FP8 is real (roughly 84% faster on FLUX 1 Dev), VRAM savings are significant, and quality degradation on FLUX models is acceptable for most production workflows.

The only prerequisite is getting PyTorch on CUDA 13 (see setup steps below). Without that, you’re running software-emulated FP4 and will likely see worse performance than FP8.

One nuance for the RTX 5060 Ti: the 16GB card can load NVFP4 FLUX 1 Dev (which needs ~14 GB) with 2 GB of headroom. That’s tight for large batches or multi-ControlNet workflows. FP8 Scaled at ~17 GB is over the limit, so NVFP4 is actually the only path to running full-resolution FLUX 1 Dev on that card. For the RTX 5070 and above with 12+ GB to spare above NVFP4’s footprint, the math is comfortable.

RTX 40-Series (RTX 4070, 4080, 4090): Use FP8 Scaled — Skip NVFP4

The RTX 4090, RTX 4080 Super, RTX 4070 Ti Super — none of them have native FP4 Tensor Cores. NVFP4 will load on these cards if you try, but it runs in emulation mode and the benchmark reality is it runs at best comparably to FP8, and at worst 2× slower. The community consensus: stick to FP8 Scaled (also called NVFP8 in NVIDIA’s naming).

FP8 Scaled on RTX 40-series delivers:

~40% VRAM reduction vs BF16
Speed comparable to BF16 or slightly faster (4.21 it/s for FP8 vs 4.53 it/s for BF16 on Ada — essentially tied, with the VRAM savings being the actual win)
No meaningful quality difference vs BF16 in practice

If you see people benchmarking NVFP4 on RTX 4090 and getting impressive numbers, check whether they’re on PyTorch cu130 with CUDA 13 and whether they’ve confirmed FP4 hardware acceleration is actually being used. Without a Blackwell card, the claimed speedup won’t materialize.

RTX 30-Series (RTX 3090, 3080, 3060): FP8 Scaled or GGUF

The RTX 3090 runs FLUX Dev at ~26 seconds per image in FP8 — roughly 5× slower than an RTX 5090 doing NVFP4. There’s no quantization format that closes that gap on Ampere hardware; FP8 on 30-series runs in software emulation as well, delivering minimal speedup over FP16/BF16. GGUF Q4-Q8 is often the best choice here because it’s memory-efficient without hardware-accelerated requirements.

That said, the RTX 3090 with its 24GB VRAM still has an edge over 16GB cards for running larger models without quality-hurting quantization. It’s not a speed demon for image generation in 2026, but for tasks where VRAM ceiling matters more than throughput, it holds up.

Available NVFP4 Model Checkpoints

Black Forest Labs has released NVFP4-quantized versions of their main models, available on Hugging Face:

FLUX.1-dev-NVFP4 — the standard text-to-image dev model at ~14 GB
FLUX.2-dev-NVFP4 — the updated successor, same ~14 GB footprint
FLUX.1-Kontext-dev-NVFP4 — the image-editing model (covered in the FLUX Kontext guide)

NVIDIA has also released NVFP4 checkpoints for:

Z-Image (Alibaba) — a high-speed turbo model
Qwen-Image (Alibaba)

LTX-2.3 (Lightricks’ video generation model) was announced as coming to NVFP4 support. If you’re running WAN video generation, check the WAN GPU guide for context on video model VRAM requirements.

All NVFP4 checkpoints use the .safetensors format and load through ComfyUI’s standard diffusion model loader once you’ve upgraded PyTorch to cu130.

Setup: Upgrading to PyTorch CUDA 13 for NVFP4

This is the step most users miss. NVFP4 hardware acceleration requires PyTorch built with CUDA 13.0. Without it, you’re emulating FP4 in software and the speedup disappears — or reverses.

Step 1: Update your NVIDIA driver

You need driver version ≥580. Check your current version with:

nvidia-smi

If below 580, download the latest driver from NVIDIA’s site before continuing. Older drivers don’t expose the FP4 instruction path to CUDA.

Step 2: Install PyTorch with CUDA 13.0 (cu130)

For the embedded Python that ships with ComfyUI portable (Windows):

python_embeded\python.exe -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

For a standard Python venv (Linux/macOS or custom Windows install):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

This replaces your existing PyTorch installation. The cu130 wheel includes CUDA 13.0 runtime support. As of mid-2026, torch-2.9.1+cu130 is the stable build available on this index.

Step 3: Verify the installation

import torch
print(torch.__version__)           # should show 2.9.x+cu130
print(torch.cuda.is_available())   # True
print(torch.cuda.get_device_name(0))  # your GPU name

If torch.cuda.is_available() returns False, your driver version is too old or the CUDA install is broken. Re-check Step 1.

Step 4: Download and place the NVFP4 checkpoint

Download the FLUX.1-dev-NVFP4 (or your chosen model) from Hugging Face. Place it in:

ComfyUI/models/diffusion_models/ (preferred for FLUX models)
Or ComfyUI/models/checkpoints/ if your workflow uses the checkpoint loader

Step 5: Load in ComfyUI

Use the standard Load Diffusion Model node (not the legacy Checkpoint Loader). Point it at your NVFP4 .safetensors file. ComfyUI’s backend detects the NVFP4 format automatically and routes to the FP4 hardware path if running on a Blackwell GPU with cu130.

You do not need a custom node or ComfyUI Manager plugin. NVFP4 is part of ComfyUI’s native model loading pipeline as of the current release.

What Can Go Wrong

“NVFP4 is slower than FP8 on my RTX 5090”

Almost always a PyTorch version issue. Run torch.__version__ and confirm you see cu130. If you see cu121, cu124, or similar, the CUDA 13 backend isn’t loaded and you’re in software emulation. Re-run the pip install command from Step 2 to force the cu130 build.

“ComfyUI throws an error on NVFP4 model loading”

Check ComfyUI version. NVFP4 model loading was stabilized in releases after early 2026. Pull the latest main branch or download a recent portable build. The issue tracker (GitHub issue #11864) documents some earlier failures on RTX 5090 with Wan 2.2 and FLUX 2 Dev — these were patched in subsequent ComfyUI updates.

“I get ‘CUDA out of memory’ despite NVFP4’s smaller footprint”

NVFP4 for FLUX 1 Dev uses ~14 GB for the diffusion model itself, but you still need VRAM for the VAE (~500 MB–1 GB), T5-XXL text encoder (4–9 GB depending on precision), and activations during sampling. On a 16GB card, load the FP8-scaled T5 encoder (t5xxl_fp8_e4m3fn_scaled.safetensors) to keep the text encoder under 5 GB. Total at that point: ~14 + 5 + 1 = ~20 GB — still over the 16GB limit, which means 16GB cards may need to rely on CPU offloading for the text encoder.

“Quality looks worse than FP8”

NVFP4 does introduce more compression than FP8. On photorealistic prompts with fine text, intricate fabrics, or complex backgrounds, NVFP4 can show minor softening compared to FP8 Scaled at close inspection. For most portrait, concept art, and product workflows, the difference is not meaningful. If quality is your first priority over speed, run FP8 Scaled — you lose the 2× speed advantage but retain virtually identical quality to BF16.

When NVFP4 Changes the Calculus on Cloud vs Local

If you’re currently renting GPU time on RunPod or similar platforms because your local hardware couldn’t keep up with FLUX generation speeds, NVFP4 changes the math significantly. An RTX 5090 at NVFP4 produces ~20 images per minute on FLUX 1 Dev at 20 steps. At the going cloud rate of roughly $0.055 per 100 images (RTX 5090 rentals), you’d pay $16.50 for 30,000 images. A used RTX 5090 purchased for ~$1,700 breaks even around 100,000 images — a few months of heavy creative work. If your workflow involves lower-throughput tasks like FLUX Kontext editing, the break-even is faster because API costs for Kontext editing run higher per-image. Once the local box is earning its keep at these speeds, it’s worth running ComfyUI as a real service rather than a terminal session — the ComfyUI on Linux production setup guide covers the systemd unit, HTTPS via Caddy, and Tailscale remote access that turn a workstation into a always-on household render server.

FAQ

Can I use NVFP4 on an RTX 4090 and get the 3× speedup?

No. The RTX 4090 has no native FP4 tensor core hardware. The 3× speedup requires Blackwell’s sm120 architecture. On an RTX 4090, NVFP4 either performs similarly to FP8 (best case with CUDA 13 software path) or is up to 2× slower (without CUDA 13).

Does NVFP4 work for Stable Diffusion 1.5 or SDXL models?

Not yet. NVFP4 checkpoints require the model to be re-quantized and released in NVFP4 format. Community SD 1.5 and SDXL checkpoints (.ckpt, .safetensors) are not NVFP4 — they’re standard FP16/BF16 or community-quantized formats. For SDXL on RTX 50-series, FP8 Scaled is still the practical fast path.

Will NVFP4 help with video generation models like WAN or LTX?

LTX-2.3 NVFP4 support was announced and is expected to be available. For WAN 2.x, NVFP4 support depends on model releases from the WAN team. As of mid-2026, these are still primarily running FP8 Scaled for video generation — see the WAN video guide for current options.

Do I need to reinstall ComfyUI to get NVFP4 support?

No. You only need to upgrade PyTorch to the cu130 build. ComfyUI itself detects NVFP4 format files automatically once PyTorch has CUDA 13 support.

Is there a way to use NVFP4 on AMD or Intel GPUs?

No. NVFP4 is NVIDIA Blackwell-specific. It uses SM120 instruction sets that don’t exist on AMD RDNA or Intel Arc architectures. AMD ROCm users should look at FP8 and GGUF quantization instead.

Sources

Last updated June 8, 2026. GPU prices and benchmark results change as drivers and software mature; verify current figures before purchasing.

Recommended Gear

Was this article helpful?