NVIDIA Nemotron 3 Ultra for Local AI in 2026: 550B/55B-Active MoE, 1M Context, NVFP4 — Which Consumer GPU Can Actually Run It
TL;DR: Nemotron 3 Ultra is NVIDIA’s June 4, 2026 open-weight flagship — a 550B-parameter Mixture-of-Experts model with 55B active per token, a 1M-token context window, and a native NVFP4 4-bit checkpoint. The catch: NVFP4 still weighs ~275GB, so it’s a datacenter model, not a home-lab one. The right local move is to run the smaller Nemotron 3 family members — Nano 30B-A3B fits a single 24GB card — and reach for Ultra through the API.
| Nemotron 3 Ultra (550B) | Nemotron 3 Nano (30B-A3B) | Ultra via API/cloud | |
|---|---|---|---|
| Best for | Datacenter agents, 8×H100/H200 | Single-GPU home labs | Trying Ultra without the rig |
| Smallest footprint | ~275GB NVFP4 / 189GB 1-bit GGUF | ~20.7GB Q4_K_M | None — managed |
| Runs on | 8× H100 80GB (640GB) min | RTX 3090 / 4090 (24GB) | Any device |
| Speed | ~40 tok/s (4× B200) to 300+ (cloud) | 30B-class tok/s on one card | ~140 tok/s blended |
| The catch | No consumer GPU holds it | Not the 550B brain | Prompts leave your machine |
Honest take: Nemotron 3 Ultra is a genuinely strong open model — but for a home lab it’s an API model, full stop. If you want NVIDIA’s reasoning quality on hardware you own, run Nemotron 3 Nano 30B-A3B on a single 24GB GPU and call the Ultra endpoint only for the hard agentic runs.
What Nemotron 3 Ultra actually is
NVIDIA announced Nemotron 3 Ultra at Computex 2026 on June 1 and published the weights on June 4. It’s the top of a three-model family — Nano (30B-A3B), Super (120B-A12B), and Ultra (550B-A55B) — and it’s the first one NVIDIA positions as an open frontier model rather than a distillation target.
The headline numbers: 550 billion total parameters, 55 billion active per token, a 1M-token context window, and a license that’s unusually generous for a model this size — OpenMDW-1.1, the Linux Foundation’s open-weights license, which releases the weights, the training datasets (including 173 billion tokens of code), and the recipes. That’s a real differentiator. Kimi K2.7 ships under a modified MIT license and GLM 5.2 under MIT, but neither publishes its training data the way NVIDIA does here.
Architecturally it’s not a vanilla transformer MoE. Nemotron 3 Ultra uses a hybrid “LatentMoE” design: interleaved Mamba-2 state-space layers and MoE layers, with select attention layers, plus Multi-Token Prediction (MTP) heads with a shared-weight design. The MTP heads enable native speculative decoding — the model drafts its own next tokens — which is a big part of why NVIDIA can claim the throughput numbers it does. (If the phrase “speculative decoding” is new to you, we broke down why it matters in why local LLMs got good in mid-2026.)
The NVFP4 trick — and why it doesn’t save you
The interesting engineering story is NVFP4. NVIDIA quantized the model to its 4-bit floating-point format for weights, activations, and gradients, keeping a few sensitive layers (latent projections, MTP heads, QKV/attention projections, embeddings) in BF16 or MXFP8 for stability. The clever part: the same NVFP4 checkpoint runs on Ampere, Hopper, and Blackwell GPUs thanks to specialized quantization kernels. One file, three architectures.
NVFP4 makes the model dramatically smaller than its BF16 form — but “smaller” is relative when you start at 550 billion parameters. The NVFP4 checkpoint is roughly 275GB. For comparison, the BF16 cache lands around 1.1–1.7TB depending on configuration.
275GB is the number that ends the home-lab dream. To put it in perspective:
- A used RTX 3090 (24GB) or RTX 4090 (24GB) gives you 24GB each.
- You’d need roughly 12× RTX 3090 just to hold the NVFP4 weights — before any KV cache for that 1M context.
- NVIDIA’s own recommended deployment is 8× H100 80GB (640GB total, comfortably above 275GB) or 8× H200 SXM5 (1,128GB total).
No single consumer card, and no realistic stack of them, runs the full Ultra at a sane speed. This is the same wall we hit with Kimi K2.7 Code and GLM 5.2: the open-weights frontier has moved decisively past 24GB consumer hardware.
The GGUF / CPU-offload path (for the stubborn)
If you absolutely must run Ultra on hardware you own, the community route is Unsloth’s dynamic GGUF quants run through llama.cpp with CPU offload. Here’s the memory reality from Unsloth’s own guide:
| Quant | Approx. memory needed | Notes |
|---|---|---|
| Dynamic 1-bit (UD) | ~189GB disk | Smallest; surprising accuracy retention |
| 3-bit (UD-IQ3_XXS) | ~256GB RAM | Unsloth’s recommended balance |
| 4-bit | ~300GB RAM | |
| 8-bit | ~600GB RAM |
So the cheapest “it technically runs” build is a workstation with 256GB of DDR5 running the 3-bit quant, with a 24GB GPU offloading the active expert and attention layers. Because only 55B of the 550B parameters are active per token, the compute per token is closer to a 55B dense model than a 550B one — that’s what makes CPU-offload even thinkable. But you’re still streaming hundreds of gigabytes of weights from RAM, so don’t expect speed. NVIDIA’s own llama.cpp reference shows around 40 tokens/second on 4× B200 — datacenter Blackwell silicon. On a DDR5 CPU build you’re realistically looking at single-digit-to-low-teens tok/s once prefill on long prompts is factored in, the same ballpark we measured for other 1T-class MoE models on big-RAM rigs.
A 256GB DDR5 workstation is roughly a $3,500–$4,500 build in mid-2026 — and that’s before the DDR5 price surge that’s still squeezing home builds (we tracked it in the DDR5/SSD price guide). Compared to the API, the math rarely favors building. Which brings us to the real recommendation.
What you should actually run at home
The good news is that Nemotron 3 is a family, and the two smaller members are built for exactly the hardware most readers have.
Nemotron 3 Nano 30B-A3B — the single-GPU pick
This is the one to run. The Nano is a 30B-total / 3B-active MoE that lands at ~20.7GB at Q4_K_M, which fits on any 24GB card — an RTX 3090, 3090 Ti, or 4090. NVIDIA reports the Nano hitting roughly 3.3× the throughput of Qwen3-30B-A3B on identical hardware (a single H200), and on a 24GB consumer card you can expect the same 30B-MoE-class speeds we’ve measured elsewhere — think the 100+ tok/s range that Nemotron-Cascade 2 hit on an RTX 3090. For everyday reasoning, coding, and agent loops, the Nano gives you NVIDIA’s training quality without the 275GB problem.
# Pull and run the Nano locally via Ollama
ollama pull nemotron-3-nano
ollama run nemotron-3-nano "Refactor this function and explain the change."
Nemotron 3 Super 120B-A12B — the multi-GPU step-up
The Super is a 120B-total / 12B-active MoE. At 4-bit it needs roughly 60–80GB, which puts it beyond a single 24GB card and into multi-GPU or workstation-card territory (think 2× 24GB GPUs with offload, an A100 80GB, or an RTX PRO 6000 96GB). It’s the right pick if you’ve outgrown the Nano’s quality but can’t justify a datacenter node for Ultra.
Ultra — through the API or a rented GPU
For the actual 550B brain, use it the way it’s meant to be used at this scale: managed. Nemotron 3 Ultra is available on Ollama’s cloud (nemotron-3-ultra:cloud), through NVIDIA’s NIM endpoints, and on third-party hosts. Artificial Analysis estimates a blended cost around $0.52 per million tokens with output speed near 140 tokens/second, and clocked 300+ output tok/s on a pre-release DeepInfra endpoint. If you need to own the inference for privacy reasons, renting an 8×H100 node by the hour on RunPod is far cheaper than buying one — the same rent-vs-buy logic from our RunPod vs local GPU breakdown.
Is it actually good? The benchmark reality
Throughput claims are easy to make; quality is the question. On the Artificial Analysis Intelligence Index, Nemotron 3 Ultra scores 47.7 — well ahead of the next-strongest US open-weights models (Gemma 4 31B at 39.2, Nemotron 3 Super at 36.0), but still behind the Chinese-led open frontier (Kimi K2.6 at 53.9). So it’s the strongest US open model, not the strongest open model overall.
Where it clearly wins is throughput-per-dollar on long agentic runs. NVIDIA’s measured figures at an 8k-input / 64k-output setting show Ultra delivering:
- 5.9× the throughput of GLM-5.1-754B-A40B
- 4.8× the throughput of Kimi-K2.6-1T-A32B
- 1.6× the throughput of Qwen-3.5-397B-A17B
Combined with native speculative decoding and the MTP heads, that’s the real pitch: Ultra is built to be fast and cheap across the hundreds of tool calls an agent makes, even if a Chinese model edges it on a one-shot reasoning score. NVIDIA claims up to 30% lower cost for agentic workloads as a result. For a fuller picture of how it stacks against the models you can actually run locally, see our open-source LLM consumer-GPU shootout.
The honest verdict
Nemotron 3 Ultra is an impressive, genuinely open model — and almost entirely irrelevant to what sits under your desk. At 275GB in its smallest native format, it belongs on an 8-GPU datacenter node, and the CPU-offload GGUF path, while real, is a slow $4,000 science project that the API beats on every axis except data residency.
If you came here to find the GPU that runs Nemotron 3 Ultra, the truthful answer is: none of the ones you’d buy for a home lab. Run Nemotron 3 Nano 30B-A3B on a 24GB card for daily work, step up to Super 120B-A12B if you have the multi-GPU headroom, and treat Ultra as an API call for the jobs that need the big brain. That keeps your money in a card you’ll actually use and your hardest agentic runs on silicon that can keep up.
FAQ
Can I run Nemotron 3 Ultra on an RTX 4090 or 5090? No. Even the NVFP4 checkpoint is ~275GB and the smallest 1-bit GGUF is ~189GB on disk. A single 24GB or 32GB card can’t hold a meaningful fraction of it. Run Nemotron 3 Nano (~20.7GB Q4_K_M) instead, or use the Ultra API.
What’s the cheapest hardware that runs Ultra locally at all? A workstation with ~256GB of DDR5 RAM running Unsloth’s 3-bit (UD-IQ3_XXS) GGUF with a 24GB GPU for offload — roughly a $3,500–$4,500 build. Expect single-digit-to-low-teens tokens/second. For most people the API is cheaper and far faster.
What is NVFP4 and does it help home users? NVFP4 is NVIDIA’s 4-bit floating-point format. It shrinks Ultra from a ~1.1–1.7TB BF16 cache to ~275GB and runs the same checkpoint on Ampere, Hopper, and Blackwell. It helps datacenter deployment a lot; it doesn’t bring a 550B model within reach of consumer VRAM.
How does Nemotron 3 Ultra compare to Kimi K2.6 and Gemma 4? On the Artificial Analysis Index, Ultra scores 47.7 vs Kimi K2.6’s 53.9 (ahead of Ultra) and Gemma 4 31B’s 39.2 (behind Ultra). Ultra’s real edge is throughput on long agentic runs — up to 5.9× faster than GLM-5.1 at an 8k/64k setting.
What license is Nemotron 3 Ultra under? OpenMDW-1.1, the Linux Foundation’s open-weights license. It releases the weights, the training datasets (including 173B tokens of code), and the recipes — more transparent than most models at this scale.
Sources
- NVIDIA Nemotron 3 Ultra — Ollama Blog
- NVIDIA Nemotron 3 Ultra Powers Faster, More Efficient Reasoning for Long-Running Agents — NVIDIA Technical Blog
- Deploy NVIDIA Nemotron 3 Ultra on GPU Cloud: Self-Host the 550B Reasoning Model — Spheron Blog
- NVIDIA Nemotron 3 Ultra — How To Run Locally — Unsloth Documentation
- NVIDIA Nemotron 3 Ultra released: fast, intelligent, and open — Artificial Analysis
- NVIDIA Nemotron 3 Nano — How To Run Guide — Unsloth Documentation
- Nemotron 3 Nano 30B VRAM Requirements — canitrun.dev
- nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 — Hugging Face
- RTX 3090 Price Tracker US — bestvaluegpu.com
Last updated June 21, 2026. Prices and specs change; verify current rates before purchasing.
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →