Jun 25, 2026

NVIDIA Cosmos 3 Nano for Local AI in 2026: 16B Omnimodel, BF16-Only, and Whether Your Consumer RTX Can Actually Run It

By RunAIHome Team · 12 min read

nvidiacosmosphysical-aigpuvramlocal-ai

TL;DR: Cosmos 3 Nano is genuinely open (OpenMDW-1.1, commercial use allowed) and “only” 16B parameters, which makes it sound like a 24GB-card model. It isn’t. NVIDIA ships it BF16-only — no FP8, no FP4, no GGUF — so the weights alone are ~32GB, and the recommended card is a 96GB RTX PRO 6000. Your RTX 3090 can’t load it; an RTX 5090 holds the weights with almost nothing left for the work.

	Used RTX 3090 24GB	RTX 5090 32GB	RTX PRO 6000 96GB
Runs Cosmos 3 Nano?	No — weights don’t fit	Technically, with no headroom	Yes — NVIDIA’s recommended card
VRAM vs ~32GB BF16 weights	24GB (short by 8GB)	32GB (exactly the weights)	96GB (room for context + video frames)
Price (Jun 2026)	~$1,070 used	~$4,000+	~$8,000–$13,250
Best for	LLMs, not Cosmos	Edge case only	The intended use case

Honest take: Don’t buy hardware for Cosmos 3 Nano right now. Rent a workstation GPU by the hour to evaluate it, keep your local rig on Qwen3.6 and Gemma 4 for everyday LLM work, and wait for community quantization before committing $8,000+ to a card.

What Cosmos 3 Nano actually is

Cosmos 3 is NVIDIA’s first fully open “omnimodel” for Physical AI — a single model that reasons about the physical world and generates the pixels, audio, and robot actions that follow from that reasoning. It was released on May 31, 2026 and announced at GTC Taipei on June 1, 2026, with the full technical report following on June 22, 2026.

“Omnimodel” is doing real work in that sentence. Most local models you run today take text in and emit text out. Cosmos 3 natively understands and generates text, images, video, ambient audio, and action sequences in one set of weights. That’s the whole point: instead of stitching a vision-language model to a separate video diffuser to a separate policy network, you get one model that reasons before it acts.

The architecture behind this is a two-tower Mixture-of-Transformers (MoT). Each Cosmos 3 model is split into two equally-sized transformers that work together:

A Reasoner that understands the scene, plans, and emits the structured representation that guides what comes next.
A Generator that turns that plan into actual pixels, waveforms, or action tokens.

For Cosmos 3 Nano, that’s 8B reasoner + 8B generator = 16B total. The larger sibling, Cosmos 3 Super, is 64B (32B + 32B) and is built for datacenter Hopper/Blackwell deployment, not home labs. There are also task-specialized checkpoints — Super Text2Image, Super Image2Video, and a Nano-Policy-DROID world-action model fine-tuned on the DROID robotics dataset.

So “Nano” here means “the small one in a family aimed at robots and autonomous machines,” not “the one that fits a gaming GPU.” That distinction is the entire hardware story.

The number that breaks the consumer-GPU dream: BF16-only

Here’s where the “16B, so it fits 24GB” intuition falls apart.

When you run a 16B language model locally, you almost always run it quantized. A 16B LLM at Q4_K_M is roughly 9–10GB and drops comfortably onto a 12GB card. That math is so ingrained that the obvious assumption — and the one the headline begs you to make — is that Cosmos 3 Nano lands on a 24GB RTX 3090 with room to spare.

It doesn’t, because Cosmos 3 Nano is tested and supported in BF16 only. The maintained vLLM deployment path explicitly notes that other precisions — FP4, FP8, even FP16 — are not officially supported for the model. There is no official GGUF, no Q4_K_M, no llama.cpp or Ollama path. As of late June 2026, the only sanctioned way to run it is the full BF16 checkpoint through vllm-omni.

BF16 means 2 bytes per parameter. 16 billion parameters × 2 bytes ≈ 32GB just for the weights — before you add the KV cache, before activation memory, and before the substantial buffers an omnimodel needs to actually generate video frames and audio. That’s why NVIDIA’s own recommendation for single-GPU Nano inference is the RTX PRO 6000 Blackwell with 96GB of VRAM — not because the weights need 96GB, but because generating world-model output on top of a 32GB weight footprint needs serious headroom.

The official test hardware listed for Cosmos 3 is GB200 and H100, with Ampere, Hopper, and Blackwell as the supported microarchitecture families. Notice what’s missing: no mention of GeForce cards as a target.

Card-by-card: what can and can’t run it

Used RTX 3090 / 3090 Ti / 4090 (24GB) — Can’t load it. The BF16 weights are ~32GB and there’s no quantized path to shrink them. This is the painful one, because the used RTX 3090 is still the value king for local LLMs at around $1,070 — but Cosmos 3 Nano is the rare 2026 model where 24GB simply isn’t enough and you can’t quantize your way out.

RTX 5090 (32GB) — Technically holds the weights, practically marginal. 32GB of GDDR7 matches the ~32GB BF16 footprint almost exactly, which leaves essentially nothing for the KV cache, activation memory, or the frame buffers an omnimodel needs to generate video. You might coax a short text-or-image reasoning request through it; sustained world-generation or action rollout is where you’ll hit out-of-memory. At ~$4,000+ street price in June 2026, it’s an expensive way to be perpetually one frame from a CUDA OOM. (If you’re weighing a 5090 for general AI work, our RTX 5090 vs RTX 4090 breakdown covers the everyday-LLM case, which is a much better fit for that card.)

RTX PRO 6000 Blackwell (96GB) — The intended card. 96GB of GDDR7 gives the ~32GB of weights plenty of room for context and generation buffers, and NVIDIA explicitly recommends it for single-GPU Nano deployment. The catch is the price: it launched around $8,565, and NVIDIA’s own listing has since climbed to $13,250 — a 60%+ increase driven by the same GDDR7 supply crunch hitting the whole stack. We went deep on this card in our RTX PRO 6000 Blackwell local-AI guide; for Cosmos 3 Nano specifically, it’s the only single consumer-adjacent GPU that runs the model the way NVIDIA tested it.

Cosmos 3 Super (64B) — Out of scope for home labs entirely. It’s a datacenter model: the maintained recipe runs it with --tensor-parallel-size 4 --enable-layerwise-offload across four GPUs. If you’re asking whether your tower can run Super, the answer is no.

How you’d actually deploy Nano

The maintained path is a single Docker image. NVIDIA ships vllm/vllm-omni:cosmos3, and a single-GPU Nano launch looks like this:

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
    --omni \
    --model-class-name Cosmos3OmniDiffusersPipeline \
    --allowed-local-media-path / \
    --port 8000 \
    --init-timeout 1800

A few things worth flagging before you copy-paste:

--tensor-parallel-size 1 is correct for Nano — it’s a single-GPU model and needs no model parallelism. That’s only true because the card is big enough; on Super you’d set it to 4.
--init-timeout 1800 (30 minutes) isn’t paranoia. The Nano download plus CUDA dependencies run into tens of GiB, and first-load initialization is slow. Budget the disk and the wait.
No --enable-layerwise-offload for Nano. That flag exists for Super on memory-constrained datacenter GPUs. If you find yourself reaching for offload flags to squeeze Nano onto a smaller card, you’re fighting the model — there’s no supported low-VRAM config today.
It’s vllm, not Ollama or llama.cpp. Because there’s no GGUF, the friendly local-runner ecosystem doesn’t apply yet. You’re running a vLLM server and talking to it over an OpenAI-compatible API on port 8000.

If you don’t already have a self-hosting workflow for containerized model servers, the aifoss.dev self-hosting guides are a good companion for the Docker-and-reverse-proxy side of this.

The license is the genuinely good news

Cosmos 3 ships under OpenMDW-1.1, and this is where it earns the “open” label. The license permits commercial use — you can build a product on top of it — with an attribution requirement (“Built on NVIDIA Cosmos”). That’s a meaningfully more permissive position than a lot of “open weights” releases that quietly forbid commercial deployment.

One nuance worth getting right: OpenMDW-1.1 is not Apache 2.0. If you’re used to grabbing Apache-licensed models like Qwen and Devstral and dropping them into a product without a second thought, read the attribution and use terms before you ship. For a home lab tinkering with robot-arm control or a custom sensor pipeline, none of this matters. For a startup planning to commercialize a Cosmos-based product, the difference between “permissive with attribution” and “Apache 2.0” is worth a lawyer’s afternoon.

Why a home-labber would care at all

If you only run text LLMs, you can skip Cosmos 3 — your RTX 3090 and Qwen3.6 will serve you better and cheaper. But Physical AI is a different itch, and Cosmos 3 Nano is the first time an omnimodel this capable has been open and commercially usable:

Home automation that reasons about video. A model that natively understands a camera feed and plans an action is a different beast from a classifier that just labels frames.
Robot-arm and visuomotor control. The Nano-Policy-DROID checkpoint is fine-tuned for exactly this — world-action prediction on a real robotics dataset.
Synthetic data generation. NVIDIA’s pitch is that Cosmos collapses physical-AI training and evaluation cycles “from months to days” by generating photorealistic, physics-accurate scenarios on demand.

The point of caring now, even if you don’t buy the card, is that this is the on-ramp. The model is open, the deployment path is documented, and the smaller community will start producing quantized checkpoints. When a working FP8 or community GGUF lands, the 32GB wall drops toward ~16GB and suddenly a 24GB card is back in play.

The cheap way to actually try it

You do not need to spend $8,000 to find out whether Cosmos 3 Nano is useful to you. Rent the card.

A workstation-class GPU on RunPod lets you spin up the exact hardware NVIDIA tested on, run the vllm-omni container, and pay by the hour. An RTX PRO 6000 96GB instance runs about $2.09/hr, and an H100 about $2.89/hr (verified May 2026). At $2.09/hr, you can evaluate Cosmos 3 Nano for a full 8-hour day for under $17 — versus $8,000+ to own the card, or $13,250 at NVIDIA’s current listing.

The break-even math is brutal for buying: even at the optimistic $8,000 purchase price, you’d need roughly 3,800 hours of RTX PRO 6000 time before owning beats renting at $2.09/hr — and that ignores the electricity and the GDDR7 price risk. Unless you’re running Physical AI workloads daily as a business, renting is the rational call right now. Our full rent-vs-buy framework walks through where that line actually sits for different usage patterns.

FAQ

Can I run Cosmos 3 Nano on an RTX 3090 or 4090? No. Both are 24GB cards and the BF16 weights are ~32GB with no officially supported quantized version. There’s no flag or trick that loads it on 24GB today.

Is there a GGUF or quantized version for Ollama/LM Studio? Not officially, as of late June 2026. NVIDIA tests and supports BF16 only; FP4, FP8, and FP16 are explicitly unsupported for the model. Watch the community for quantized checkpoints — that’s the development that would change the hardware picture most.

What’s the minimum VRAM, really? ~32GB just to hold the weights in BF16. NVIDIA recommends a 96GB RTX PRO 6000 so there’s headroom for context and the video/audio generation buffers an omnimodel needs. A 32GB RTX 5090 holds the weights but leaves almost nothing for the actual work.

How is Cosmos 3 Nano different from a normal local LLM? It’s an omnimodel — one set of weights that reasons about the physical world and generates images, video, audio, and robot actions, using a two-tower reasoner+generator architecture. A standard LLM only handles text in and text out.

What’s the difference between Nano and Super? Nano is 16B (8B + 8B) and targets a single workstation GPU. Super is 64B (32B + 32B), targets the datacenter, and the recipe runs it across four GPUs with layerwise offload. Super is not a home-lab model.

Is the license actually free for commercial use? OpenMDW-1.1 permits commercial use with a “Built on NVIDIA Cosmos” attribution requirement. It’s permissive but it is not Apache 2.0 — read the terms before commercializing a product on top of it.

Sources

Last updated June 25, 2026. Prices and specs change; verify current rates before purchasing. Cosmos 3 Nano support is BF16-only as of this writing — a community quantized checkpoint would materially change the consumer-GPU math.

Recommended Gear

RTX PRO 6000 Blackwell 96GB — the only single workstation card NVIDIA recommends for Cosmos 3 Nano
RTX 5090 32GB — holds the weights but with no headroom; better suited to everyday LLM work
Used RTX 3090 24GB — can’t run Cosmos 3 Nano, but still the value pick for local LLMs

Was this article helpful?