Jun 8, 2026

Nemotron-Cascade 2 for Local AI in 2026: 187 tok/s on RTX 3090 and What 30B Total / 3B Active Really Means for Your GPU

By RunAIHome Team · 13 min read

nvidianemotronlocal-llmgpuvramcoding-modelmoe

TL;DR: Nemotron-Cascade 2 30B-A3B is NVIDIA’s open MoE coding specialist: 3B active parameters per token means RTX 3090 generation speeds of 187 tok/s, while all 30B weights must fit in VRAM simultaneously. The 24GB floor is firm. 16GB cards technically run it, but at 10–11 tok/s it’s unusable for interactive coding.

	RTX 3090 24GB (used)	RTX 4090 24GB	RTX 5090 32GB
Best for	Best-value entry	Top single-GPU speed	NVFP4 + 32GB headroom
Quantization	IQ4_XS (18.2 GB)	Q4_K_M (Ollama default)	NVFP4 or Q4_K_M
Generation speed	187 tok/s	~196 tok/s	229 tok/s
Street price (Jun 2026)	~$1,050 used	~$2,300 used	$1,999+

Honest take: If you own a 24GB GPU and write code, Nemotron-Cascade 2 is the right model to be running right now. It beats Qwen 3.6 35B-A3B on LiveCodeBench by 12 points and is faster. On a used RTX 3090, IQ4_XS gives you 187 tok/s for around $1,050.

NVIDIA published Nemotron-Cascade 2 in March 2026 under open weights (arXiv 2603.19220). The training approach — Cascade Reinforcement Learning with multi-domain on-policy distillation — produced a model that wins gold medals at IMO 2025 (35 points) and IOI 2025 (439.3 points). The benchmark that got the local AI community paying attention: Nemotron-Cascade 2 beats Nemotron-3-Super-120B-A12B on math and coding despite requiring roughly 4× less VRAM. That’s the headline. But before running ollama pull, you need to understand the memory trap that comes with every MoE model in this class.

The memory math no one explains clearly

“30B total parameters, 3B active” sounds like great news. It is, but not in the way most people assume.

When Nemotron-Cascade 2 processes a token, its routing network selects roughly 3B parameters to activate. The remaining 27B do nothing for that token — idle, not useful. That’s what makes generation fast: compute scales with active parameters, and 3B-class compute on modern NVIDIA hardware is genuinely quick.

The constraint is that idle does not mean absent. Every expert layer in the model must be loaded into memory before the router can evaluate which ones to activate. The entire 30B parameter set is resident in VRAM at inference time. You cannot defer experts to system RAM without incurring a PCIe round-trip per token that kills throughput.

The practical result: Nemotron-Cascade 2 has the inference speed of a dense 8B model and the VRAM footprint of a dense 30B model. Same constraint applies to Qwen 3.6 35B-A3B and every other A3B MoE in this tier. What distinguishes Nemotron-Cascade 2 is where it puts the training budget it saves: into coding and math, not general knowledge. That’s a choice you either care about or you don’t.

VRAM requirements by quantization

Weights only, before KV cache:

Quantization	VRAM (weights)	Fits on	Notes
IQ4_XS	18.2 GB	RTX 3090, RTX 4090, RTX 5090	Verified 187 tok/s on RTX 3090
Q4_K_M	~24 GB	RTX 4090, RTX 5090	Ollama default; tight on RTX 3090
Q2_K	~16.9 GB	RTX 4060 Ti 16GB, RTX 5060 Ti	~10 tok/s; quality degraded
NVFP4	~14–16 GB	RTX 50-series only	229 tok/s on RTX 5090; RTX 40 unsupported
BF16 full	~63 GB	Dual H100 80GB	Not a consumer discussion

KV cache adds on top of model weights. At 8K context: ~1.5 GB. At 16K: ~3 GB. At 32K: ~6 GB. The IQ4_XS quant (18.2 GB) leaves the most headroom on 24GB cards — 5–6 GB free for KV cache, enough for 16K context without tuning. Q4_K_M pushes to ~24 GB, leaving less than 1 GB free on a 24GB card, which means Ollama will aggressively limit context to fit.

See the quantization quality tradeoffs guide for how much perplexity you give up going from Q4 to Q2 on models like this.

GPU compatibility by tier

RTX 3090 — 187 tok/s at IQ4_XS, the best $/tok deal right now

A used RTX 3090 runs around $1,050 on eBay as of June 2026 (typical range $900–$1,200). That’s the cheapest path to Nemotron-Cascade 2 at full quality.

The benchmark is concrete: 187 tok/s with IQ4_XS quantization, tested at 625K context, posted in the official NVIDIA model discussion thread on Hugging Face. IQ4_XS weighs 18.2 GB, leaving 5–6 GB of VRAM clear for KV cache. At 16K context — enough for roughly 12,000 lines of code — you stay well within that headroom.

The RTX 3090’s 936 GB/s bandwidth does not bottleneck this model at IQ4_XS. Generation speed at 187 tok/s already exceeds comfortable reading pace. The only real drawback is power: the 3090 draws ~285W under full LLM load, which works out to $0.050/hour at the 17.65¢/kWh US average. Over a full 8-hour coding day, that’s about $0.40. Full RTX 3090 value analysis here.

One practical note: the Ollama default for this model is Q4_K_M (~24 GB). On a 24GB card, that’s tight. Pull the IQ4_XS variant explicitly (see the setup section below) for more comfortable headroom and the verified benchmark speeds.

RTX 4090 — Q4_K_M out of the box, ~196 tok/s

RTX 4090 (~$2,300 used) runs the Ollama default without any quant selection. The model reports 24 GB in Ollama and loads cleanly into 24GB VRAM because Ollama manages the context window to avoid overflow.

Tested generation speed: approximately 196 tok/s — around 5% faster than RTX 3090 at IQ4_XS, driven by the 4090’s 1,008 GB/s bandwidth vs the 3090’s 936 GB/s. The gap widens at longer context windows where the KV cache actively stresses bandwidth.

For agentic coding workflows with 32K–64K context windows, the RTX 4090 handles the load without needing to adjust flags. The 3090 requires explicitly using IQ4_XS and capping context. If you’re running automated agents that spawn many parallel sessions, that extra VRAM headroom is meaningful.

NVFP4 is not available for RTX 40-series on this model’s current quantizations — the NVFP4 variant targets Blackwell (RTX 50-series) only. For RTX 4090, Q4_K_M or IQ4_XS are the practical formats. Full RTX 5090 vs RTX 4090 comparison here.

RTX 5090 — 229 tok/s with NVFP4, 32GB headroom

RTX 5090 ($1,999+) provides two advantages over 24GB cards: 32GB of GDDR7 VRAM and Blackwell’s native NVFP4 support.

The HuggingFace benchmark for Nemotron-Cascade 2 NVFP4 on RTX 5090 shows 229.52 tok/s in text generation (tg128 mode) — 22% faster than the RTX 4090 at Q4_K_M. More practically, 32GB means you can load Q4_K_M (~24 GB) and still have 8 GB free for KV cache, enabling 32K+ context without any configuration adjustments. Ollama just works.

A practical caution: as of early June 2026, community reports indicate vLLM has unresolved compatibility issues with NVFP4 on sm12x (RTX 5090 Blackwell) for this specific model. Ollama with Q4_K_M is fully stable. If NVFP4 matters for your use case, check the vLLM issue tracker before switching. Details on NVFP4 formats and RTX 50-series support here.

16GB cards — the hard wall

Every current 16GB consumer GPU (RTX 4060 Ti, RTX 5060 Ti 16GB, RTX 5070, RTX 5080, RX 9070 XT) hits the same constraint: the Q4_K_M quant needs ~24 GB, and 16 GB < 24 GB.

Q2_K (~16.9 GB) is the workaround. On RTX 4060 Ti 16GB: approximately 10–11 tok/s decode speed with a time-to-first-token around 17 seconds. That’s not a typo — the speed drops from 187 tok/s on IQ4_XS to under 11 tok/s on Q2_K. The MoE routing computation doesn’t rescue you from VRAM pressure; the model still needs all 30B parameters loaded.

For 16GB cards, Qwen 3.6 27B dense is the honest recommendation. It fits at Q4_K_M (~16 GB), scores 77.2% on SWE-bench Verified, and runs at 80+ tok/s on a 16GB card. You don’t need a 24GB card to get excellent local coding performance; you just can’t run this specific model well on 16GB.

Mac with 24GB+ unified memory

Mac Mini M4 Pro ($1,399, 24GB) runs IQ4_XS via llama.cpp’s Metal backend. Expect 18–25 tok/s — fast enough for interactive use, not competitive with RTX cards for batch inference or long-running agentic tasks. The unified memory architecture means the GPU and CPU share the same bandwidth pool, which is actually efficient for MoE routing: no PCIe bottleneck when loading expert parameters.

The 48GB Mac Mini M4 Pro ($1,599) handles Q4_K_M comfortably with 24 GB leftover for KV cache — a cleaner experience than squeezing Q4_K_M into a 24GB NVIDIA card. For whisper-quiet home use where throughput isn’t the priority, it’s a legitimate choice.

How to run it

Ollama (simplest start)

ollama run nemotron-cascade-2

Pulls the Q4_K_M quant (~24 GB) with 256K context. Works correctly on RTX 4090 and RTX 5090. On RTX 3090, Ollama will load Q4_K_M but context will be limited to prevent overflow.

For IQ4_XS (the faster, recommended quant on RTX 3090):

ollama pull mradermacher/Nemotron-Cascade-2-30B-A3B-i1-GGUF:IQ4_XS
ollama run mradermacher/Nemotron-Cascade-2-30B-A3B-i1-GGUF:IQ4_XS

To push context to 131K (requires ~15 GB KV cache headroom — RTX 5090 or Mac 96GB+):

OLLAMA_NUM_CTX=131072 ollama run nemotron-cascade-2

llama.cpp (more control, slightly faster on RTX)

./llama-cli \
  -m Nemotron-Cascade-2-30B-A3B-IQ4_XS.gguf \
  -ngl 99 \
  --flash-attn \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  -c 16384 \
  --temp 0.6 \
  -p "You are a coding assistant. Help with the following:"

-ngl 99 offloads all layers to GPU. --cache-type-k q8_0 halves KV cache memory usage. With flash attention, 16K context on RTX 3090 with IQ4_XS is comfortable.

Expected timing output on RTX 3090:

llama_print_timings:      eval time =  5342.12 ms /  1000 tokens (  5.34 ms per token,   187.23 tokens per second)

For vLLM-based inference (OpenAI-compatible endpoint, good for multi-user setups):

python -m vllm.entrypoints.openai.api_server \
  --model nvidia/Nemotron-Cascade-2-30B-A3B \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90

Note the vLLM NVFP4 compatibility caveat for RTX 5090 mentioned above — use Q4 formats until that’s resolved. For multi-user inference tradeoffs, see vLLM vs Ollama: when each one wins.

Open WebUI with Ollama backend

After ollama run nemotron-cascade-2, the model appears automatically in Open WebUI’s model dropdown. No additional configuration for basic use. For coding-assistant system prompts, you can set a custom system message in Open WebUI’s model settings — the model responds well to role-specific instructions.

Benchmark context: what the scores actually mean

From the official paper (arXiv 2603.19220, NVIDIA Research, March 2026):

Benchmark	Cascade 2 30B-A3B	Qwen3.5-35B-A3B	Delta
LiveCodeBench v6 (2408–2505)	87.2	74.6	+12.6 Cascade 2
LiveCodeBenchPro 25Q2 Medium	27.6	17.8	+9.8 Cascade 2
MMLU-Pro	79.8	85.3	+5.5 Qwen3.5
GPQA-Diamond	76.1	84.2	+8.1 Qwen3.5

The pattern is straightforward: Nemotron-Cascade 2 trades general knowledge coverage to win on coding and math. For daily software development, LiveCodeBench is the more relevant benchmark — it tests completion of real coding problems, not trivia recall. A 12-point edge on LiveCodeBench v6 is a meaningful difference in day-to-day use.

If your primary use is document Q&A, summarization, research assistance, or general conversation, Qwen3.5-35B-A3B scores higher and runs on the same hardware. If you write code, Cascade 2 is the right tool.

Both models are direct competitors to each other in the same VRAM tier. You can’t run both on a single 24GB card simultaneously — pick the one that matches your workload.

Running in cloud vs buying hardware

If you need Nemotron-Cascade 2 for an occasional large batch task — reviewing a full codebase, running an overnight agent pipeline — RunPod rents A40 or similar hardware for roughly $0.44–$0.59/hour, with the model available via template. A 4-hour batch job costs $2–3.

The math tilts toward owning once you’re running the model more than 8–10 hours per month. A used RTX 3090 at $500 amortizes over 24 months at ~$21/month in capital cost, plus $0.034/hour in electricity. An equivalent RunPod session at $0.50/hour runs $15 for 30 hours of use. Past that threshold, the hardware pays for itself.

Honest take

Nemotron-Cascade 2 is the strongest coding model available for local 24GB hardware in June 2026. The 187 tok/s on RTX 3090 at IQ4_XS is a verified community benchmark, not estimated. The LiveCodeBench advantage over Qwen3.5 is 12 points — large enough to notice in practice, not just on paper.

The 24GB requirement is real and non-negotiable. There is no quantization trick that runs this model acceptably on a 16GB card. If you have 16GB, Qwen 3.6 27B dense is your coding model.

If you have 24GB — RTX 3090 used at $500 or RTX 4090 at $1,600 — and you write code, this should replace whatever you’re running now.

FAQ

Can I run this on an RTX 4070 Ti Super (16GB)? Technically yes with Q2_K at ~16.9 GB, but decode speed drops to ~10–11 tok/s with 17+ second time-to-first-token. That’s painful for interactive use. Qwen 3.6 27B at Q4_K_M on the same hardware runs 80+ tok/s. Use the 27B instead.

What’s the difference between Nemotron-Cascade 2 and Nemotron-3-Nano? Different training pipelines on the same 30B-A3B architecture. Nemotron-3-Nano was the prior release. Cascade 2 adds the Cascade RL post-training stage, substantially improving coding and math scores. If you ran Nano before, Cascade 2 is a meaningful upgrade worth the re-download.

Does it support tool use and agentic workflows? Yes. NVIDIA designed it for agentic use. Ollama exposes a standard /api/chat tool-calling interface; OpenClaw agent framework (ollama launch openclaw --model nemotron-cascade-2) is officially supported. The model handles multi-turn tool call sequences reliably, which is a common failure point for smaller coding models.

Why does Ollama say 24GB if Q4_K_M is 24.5GB? Ollama reports the rounded quantized weight size. The actual VRAM allocation after loading depends on what context the model initializes with. On a 24GB card, Ollama dynamically limits context to stay within hardware bounds. IQ4_XS at 18.2 GB gives you 5–6 GB of breathing room, which is why the benchmark speeds are higher — the card isn’t memory-pressure constrained.

Is the NVFP4 quantization stable on RTX 5090? Community reports as of early June 2026 show vLLM compatibility issues with NVFP4 on RTX 5090 (sm12x Blackwell). Ollama with Q4_K_M is fully stable on RTX 5090. Check the vLLM issue tracker for current status before relying on NVFP4 in a production pipeline.

Sources

Last updated June 8, 2026. Hardware prices and model availability change; verify current rates before purchasing.

Recommended Gear

Was this article helpful?