NVIDIA Rubin CPX for Local AI Inference in 2026: What the New Context-Optimized Blackwell GPU Means for Home Labs vs Consumer Cards
TL;DR: NVIDIA’s Rubin CPX is a 30-petaFLOP, 128GB GDDR7 inference chip built for enterprise-scale million-token context workloads, arriving in data-center rack configurations in H2 2026. It’s not consumer hardware and there’s no home-lab version coming. The consumer Rubin generation (likely an RTX 6090) isn’t expected until 2027–2028 at the earliest. An RTX 5060 Ti 16GB or used RTX 3090 is still the right call today.
| Rubin CPX | RTX 5090 | RTX 5060 Ti 16GB | |
|---|---|---|---|
| Best for | Enterprise prefill, 1M+ token context | Enthusiast local AI, 30B–70B models | Budget home lab, up to 30B models |
| VRAM | 128GB GDDR7 | 32GB GDDR7 | 16GB GDDR7 |
| Memory bandwidth | ~2 TB/s | 1.79 TB/s | 448 GB/s |
| Price (June 2026) | Enterprise rack only — no public price | ~$2,000 MSRP | ~$429–499 MSRP |
| The catch | Not for home labs; prefill-only in disaggregated deployments | 70B+ models spill into system RAM | 16GB caps you at 13B–27B models in practice |
Honest take: The Rubin CPX is a data-center signal, not a home-lab buying decision. The key thing it confirms: memory bandwidth governs local inference below 100K context, and your RTX 5060 Ti or used RTX 3090 is still the right hardware to buy in June 2026.
What the Rubin CPX Actually Is
NVIDIA announced the Rubin CPX in September 2025 as a specialized inference accelerator designed for the prefill phase of large-language-model inference — the step where the model processes your entire input prompt before generating its first output token.
The specs are unusual compared to anything in a consumer build: 30 petaFLOPS of NVFP4 compute, 128GB of GDDR7 memory, and approximately 2 TB/s of memory bandwidth. The bandwidth figure looks modest against the H100 SXM’s 3.35 TB/s of HBM3e, but GDDR7 costs a fraction of HBM4 per gigabyte and scales to much higher per-die capacities. NVIDIA claims 3× faster attention processing compared to the GB300 NVL72 — the performance gap that disaggregated inference is designed to exploit.
The chip ships inside the Vera Rubin NVL144 CPX rack: 144 Rubin CPX chips for prefill paired with 144 standard Rubin GPUs (HBM4, optimized for decode) and 36 Vera ARM CPUs. The complete rack delivers 8 exaFLOPS of AI compute across 100TB of fast memory. NVIDIA targets H2 2026 availability for enterprise customers. No consumer product has been announced, and no standalone pricing has been disclosed.
The Rubin CPX connects over PCIe Gen 6 — notably without NVLink — because its role in the rack is specialized enough that it doesn’t need chip-to-chip interconnect with the standard Rubin GPUs in the same way a symmetric multi-GPU cluster would.
The Disaggregated Architecture: Prefill vs. Decode
Understanding why the CPX exists requires separating the two distinct phases of LLM inference, because they demand completely different hardware.
Prefill is compute-bound. When you send a 100,000-token prompt, the model runs attention across every token before generating word one. Attention complexity scales quadratically with context length — a 1-million-token prompt needs roughly 10,000× more prefill compute than a 10,000-token prompt. Raw FLOPS matter here; bandwidth barely does, because the memory access pattern is sequential and predictable.
Decode is bandwidth-bound. After the prefill completes, the model generates tokens one at a time. Every single output token requires reading the full KV-cache from memory — the accumulated context from all previous tokens. If your KV-cache spans 128K tokens, every new token forces a full read of that cache. At ~2 TB/s, the Rubin CPX is too slow for efficient decode at scale, which is exactly why it’s paired with HBM4-equipped standard Rubin GPUs in the same rack.
Research on disaggregated inference (the academic work underlying systems like Splitwise and DistServe) showed this split delivers up to 1.4× higher throughput at 20% lower cost compared to running both phases on identical hardware. NVIDIA is formalizing what the research community proved: specialized hardware for each phase beats general-purpose hardware for both.
Home labs don’t run disaggregated inference. Ollama, llama.cpp, and vLLM on a single card handle both phases on the same chip. That’s not a problem at the scale home labs operate — but it does explain the performance ceiling you hit when a model partially fits in VRAM and you’re doing long-context inference.
Why Your RTX Slows Down at Long Context
The VRAM math is brutal, and worth knowing before you hit the wall.
An RTX 5090 has 32GB of GDDR7. Llama 3.3 70B at Q4_K_M quantization requires approximately 38–42GB of VRAM depending on context length and KV-cache size. The model doesn’t fit. Ollama and llama.cpp handle the overflow by offloading layers to system RAM and streaming them over PCIe Gen 5 (64 GB/s peak bidirectional bandwidth), which is roughly 28× slower than GDDR7.
The result: on an RTX 5090 with Llama 3.3 70B Q4_K_M and VRAM overflow, you see 14–22 tok/s — reasonable for a single user, but nowhere near what the GPU could achieve on a model that fits cleanly.
# Check how many layers Ollama is offloading to CPU
OLLAMA_DEBUG=1 ollama run llama3.3:70b-instruct-q4_K_M "Explain quantization" 2>&1 | grep offload
Expected output on a GPU that’s overflowing VRAM:
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/81 layers to GPU
llm_load_tensors: CPU model size = 23.34 GiB
Those lines mean your decode loop is constantly pulling tensor data over PCIe. The fix is either a model that fits your VRAM (try 30B Q4_K_M at ~19GB on the RTX 5090, which runs at 45+ tok/s) or a dual-GPU configuration. For 70B without any offloading, you need 48GB minimum — two RTX 3090s bridged via PCIe, or Apple unified memory at 64GB+.
The Rubin CPX’s 128GB GDDR7 solves this for enterprise inference clusters. It doesn’t solve it for your home lab, because the chip isn’t available to buy.
Spec Comparison: Rubin CPX vs. What’s Actually in Home Labs
| Spec | Rubin CPX | RTX 5090 | RTX 5060 Ti 16GB | Used RTX 3090 |
|---|---|---|---|---|
| Architecture | Rubin (post-Blackwell) | Blackwell | Blackwell | Ampere |
| VRAM | 128GB GDDR7 | 32GB GDDR7 | 16GB GDDR7 | 24GB GDDR6X |
| Memory bandwidth | ~2 TB/s | 1.79 TB/s | 448 GB/s | 936 GB/s |
| NVFP4 compute | 30 PFLOPS | ~5 PFLOPS | ~1.3 PFLOPS | N/A (Ampere) |
| PCIe interface | Gen 6 | Gen 5 | Gen 5 | Gen 4 |
| TDP | Not disclosed | ~575W | ~165W | 350W |
| Price (June 2026) | Enterprise only | ~$2,000 | ~$429–499 | ~$450–550 eBay |
| Home lab viable? | No | Yes | Yes | Yes |
The bandwidth numbers between the used RTX 3090 (936 GB/s) and RTX 5060 Ti 16GB (448 GB/s) are worth staring at. For decode-heavy single-user inference on models that fit cleanly in VRAM, bandwidth is the dominant variable — which is why the RTX 3090 still competes with much newer hardware on models in the 7B–24B range. That full picture is in Used RTX 3090 in 2026: Still the AI Value King?
The RTX 6090 Speculation: What’s Real
A die shot analysis of the Rubin CPX silicon revealed something unexpected: the chip contains graphics-specific hardware blocks — 256 Raster Output Pipelines (ROPs) and four display output pipes — that have no function in a pure AI inference accelerator. This prompted widespread speculation that the CPX die could serve as the foundation for a future RTX 6090.
NVIDIA hasn’t confirmed anything. Industry sources cited by Moore’s Law Is Dead noted the shipping Rubin CPX is “highly specialized for Prefill/Inference” and the consumer graphics pathway would require enabling those graphics blocks and shipping them with video output, which the current rack design doesn’t support.
The consumer timeline from Tom’s Hardware: NVIDIA has no new RTX gaming GPUs planned for 2026, with the RTX 60 series expected to debut in H2 2027 at the earliest, or more likely 2028 based on current roadmap reporting. If Rubin silicon does eventually power consumer cards with graphics fully enabled, that product could arrive with massive GDDR7 capacities — 128GB+ at what would presumably be a high-end workstation price. Architectural improvements on the Rubin generation that trickle down to consumer products would be significant.
But that’s 18–30 months away. Waiting for it means sacrificing two years of local AI productivity.
The One Case Where Rubin CPX Architecture Matters to Your Home Lab Today
If your workloads require 100K+ token context windows — processing full codebases for agentic coding, analyzing book-length documents, or running long-horizon multi-step agents — the disaggregated inference architecture the CPX represents tells you something actionable: this is the workload class that consumer hardware wasn’t designed for, and no consumer GPU released in 2026 closes that gap.
At 100K context on a 70B model, the KV-cache alone requires roughly 25GB of VRAM (2 bytes × 2 (K + V) × number of attention heads × head dimension × context length × number of layers at FP16). That exhausts the RTX 5090’s 32GB before the model weights even load.
The practical solution for occasional long-context work is cloud inference. RunPod H100 SXM instances handle Llama 3.3 70B at 100K+ context cleanly at around $2.49/hr — for a few hours of heavy context work per week, that’s $20–40/month, far less than the cost of hardware that can do it locally. We covered the rent-vs-buy math in RunPod vs Local GPU: When to Rent vs When to Buy.
For everything under 32K context — coding assistants, chat, local agents, image generation — your consumer GPU is fine. Rubin CPX is solving a problem at the million-token scale that home lab workloads don’t have yet.
What to Actually Buy in June 2026
Nothing in the Rubin CPX story changes the consumer buying calculus for home AI. The enterprise chips are unavailable, and the consumer Rubin generation is years out.
For budget inference ($429–499): The RTX 5060 Ti 16GB runs Llama 3.1 8B Q4_K_M at ~80 tok/s and handles 13B models without offloading. It’s Blackwell architecture with full CUDA support and 448 GB/s of GDDR7 bandwidth. Full breakdown in RTX 5060 Ti vs RTX 4060 Ti for Local AI.
For serious 30B–70B work ($2,000): The RTX 5090 32GB is the ceiling of practical single-consumer-card local inference. Know the VRAM limits for 70B models before you buy — offloading at Q4_K_M is the reality unless you pair it with another card. Full analysis in RTX 5090 vs RTX 4090 for Local AI.
For value bandwidth ($450–550 used): A used RTX 3090 24GB from eBay delivers 936 GB/s of GDDR6X bandwidth and 24GB of VRAM. Models from 7B up to 24B run cleanly without offloading; the bandwidth advantage over the RTX 5060 Ti shows up clearly on larger quantized models. Power draw is 285–350W under full load — about $0.034–0.042/hr at $0.12/kWh — which adds up if you’re running it 24/7.
FAQ
Can I buy a Rubin CPX for my home lab? No. The Rubin CPX ships exclusively in the Vera Rubin NVL144 CPX rack, targeting enterprise AI inference deployments. No standalone consumer version or pricing has been announced, and there’s no indication one is coming.
Will the Rubin CPX run Ollama or llama.cpp? Not in any practical sense. It’s a PCIe Gen 6 device designed to operate inside a rack alongside HBM4-equipped Rubin GPUs for disaggregated inference. Consumer inference frameworks don’t support this architecture, and the hardware isn’t available to individual buyers.
What is disaggregated inference and does it affect local AI? Disaggregated inference separates the prefill phase (processing your prompt — compute-bound) from the decode phase (generating tokens — bandwidth-bound) onto hardware specialized for each. Consumer frameworks like Ollama and llama.cpp run both phases on the same GPU. The performance benefits of disaggregation apply to cloud inference providers, not home labs.
Will there be an RTX 6090 based on Rubin? Die shot analysis found graphics hardware (ROPs, display engines) in the CPX die, fueling RTX 6090 speculation. NVIDIA hasn’t confirmed consumer plans. Tom’s Hardware reporting puts the RTX 60 series at H2 2027 or later.
How do I get faster tokens on a 70B model today without enterprise hardware? Keep the model in VRAM. Use Q3_K_M quantization (~32GB) instead of Q4_K_M (~42GB) on an RTX 5090, or drop to a 30B model at Q4_K_M (~19GB) for clean fits. For occasional 70B+ inference at long context, RunPod H100 instances cost ~$2.49/hr and beat any single consumer card for that workload class.
Sources
- NVIDIA Unveils Rubin CPX: A New Class of GPU Designed for Massive-Context Inference — NVIDIA Newsroom
- NVIDIA Rubin CPX Accelerates Inference Performance and Efficiency for 1M+ Token Context Workloads — NVIDIA Technical Blog
- NVIDIA Rubin CPX GPU to Feature 128GB GDDR7 Memory, Launches End of 2026 — VideoCardz
- Nvidia Rubin CPX Forms One Half of New “Disaggregated” AI Inference Architecture — Tom’s Hardware
- Nvidia Disaggregates Long-Context Inference to Drive Bang for the Buck — NextPlatform
- Nvidia Rubin CPX Die Shot Reveals Graphics-Specific Hardware Blocks Not Needed for an AI GPU — Tom’s Hardware
- Report Claims Nvidia Will Not Be Releasing Any New RTX Gaming GPUs in 2026 — Tom’s Hardware
- Local LLM Tokens-per-Second Benchmarks 2026 — Presenc AI
- A Deep Dive into NVIDIA Rubin CPX: History, Architecture, Splitwise/DistServe, Inference Economics, and Limitations — Chiplog
- Nvidia’s Context-Optimized Rubin CPX GPUs Were Inevitable — The Register
Last updated June 7, 2026. Prices and specs change; verify current rates before purchasing.
Recommended Gear
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →