Jun 9, 2026

GPT-OSS 20B for local AI in 2026: 225 tok/s on RTX 4090, the 128k context trap, and which GPU you actually need

By RunAIHome Team · 12 min read

gpulocal-llmollamaopenaigpt-osshardware-guide2026

TL;DR: gpt-oss-20b is OpenAI’s first Apache 2.0 model and it fits on any 16 GB GPU — but only if you keep context under 8k. At 128k context, generation collapses to ~9 tok/s regardless of GPU. On an RTX 4090 with context capped at 8k, you get 225 tok/s. The 20B model is the one home-lab builders should pull; the 120B requires an H100.

	gpt-oss-20b	Gemma 4 12B	Qwen3 30B-A3B
Best for	Reasoning + tool use, OpenAI quality	Fast chat on budget hardware	Coding + multilingual on 24 GB
Min VRAM	16 GB (8k ctx)	8 GB	24 GB
RTX 4090 speed	225 tok/s	~400+ tok/s	~130 tok/s
The catch	128k context = 9 tok/s on consumer cards	Less agentic than gpt-oss	Needs 24 GB; bigger download

Honest take: If you have an RTX 3090 or better and want o3-mini-quality reasoning running locally for zero per-token cost, gpt-oss-20b is the easiest pull right now. Just set --ctx-size 8192 or you will wonder why your brand-new GPU is doing 9 tokens per second.

Why GPT-OSS is different from every model before it

Every major open-weight model family before August 2025 — Llama, Qwen, Mistral, Gemma — came from research labs that never charged for API access to their models. OpenAI did. When they released gpt-oss-120b and gpt-oss-20b on August 5, 2025 under the Apache 2.0 license, it was the first time you could pull an OpenAI-trained model, run it on your own hardware, and never send a request to their servers.

That matters for trust reasons (data stays local), cost reasons (no per-token bill), and latency reasons (no network hop). Whether the quality justifies the hardware cost depends on which GPU you own.

Architecture: 21 billion parameters, 3.6 billion at a time

Both gpt-oss models use a Mixture of Experts (MoE) Transformer. The 20B model has 21 billion total parameters organized into 128 expert sub-networks. For any given token, the router activates exactly 4 of those experts, touching only 3.6 billion parameters per token. The same approach appears in Qwen3-30B-A3B and Nemotron Cascade — but in gpt-oss, it’s paired with the reasoning post-training OpenAI uses for its o-series models.

Other architectural details from the model card:

Context: 128k tokens native (o200k_harmony tokenizer, same as GPT-4o)
Attention: grouped multi-query attention with group size 8
Positional encoding: RoPE
Quantization at training: MXFP4 post-training on MoE weights, which is why the 20B can run in 16 GB
Built-in tools: function calling, web browsing, Python execution — the same tool suite used in OpenAI’s API

The 3.6B active parameters explain the speed numbers: the router skips 94% of the weights per token, so memory bandwidth pressure stays low relative to a dense 20B model.

VRAM: what the model actually uses

The gpt-oss-20b model card reports 12.0 GB for model weights, 2.7 GB for compute buffers, and approximately 0.2 GB per 8,192 tokens of KV cache. That adds up to:

Context	Total VRAM needed
2k tokens	~15.3 GB
8k tokens	~15.5 GB
32k tokens	~16.5 GB
128k tokens	~21.7 GB

A 16 GB card sits right at the edge for 8k context — workable, not comfortable. A 24 GB card handles up to ~65k context before spilling. The RTX 5090’s 32 GB is the first consumer card that can run the full 128k context without offloading, though the speed penalty still exists (more on that below).

The Q4_K_M GGUF for local inference is 13.3 GB on disk and 12.91 GB downloaded. Pull it once with Ollama and you’re done.

Benchmark table: 8 GPUs, real numbers

These numbers are llama.cpp token generation benchmarks (tg128, Q4 quantization) from community testing as of August–September 2025. They represent sustained generation speed after the prompt has been processed.

GPU	VRAM	tok/s (tg128 Q4)	Can it run it?
RTX 5090	32 GB	282	Yes — full 128k headroom
RTX 4090	24 GB	225	Yes — comfortable to ~65k ctx
RTX 5070 Ti	16 GB	189	Yes — 8k context recommended
RTX 4080 SUPER	16 GB	186	Yes — 8k context recommended
RTX 3090	24 GB	161	Yes — comfortable to ~65k ctx
RTX 5060 Ti 16GB	16 GB	111	Yes — 8k context recommended
RX 7900 XT	20 GB	101	Yes — ROCm required
RTX 3060	12 GB	30–31	Partial (CPU offload required)

Source: llama.cpp community benchmark thread, Discussion #15396.

The RTX 3060 result comes with an asterisk: 12 GB is below the 15 GB practical minimum, so llama.cpp offloads the excess layers to system RAM over PCIe. The 30 tok/s you get is CPU-bound, not GPU-bound. If you have an RTX 3060, gpt-oss-20b will technically load and run, but you’re better served by Gemma 4 12B or Qwen3-8B.

Setup with Ollama

Ollama has a first-party gpt-oss model on its library. Two commands:

ollama pull gpt-oss:20b
ollama run gpt-oss:20b

That downloads the MXFP4-optimized GGUF (~12.9 GB) and starts a chat session. Ollama auto-detects your GPU and loads as many layers as fit in VRAM.

The critical flag: Ollama defaults to 2048 context unless you tell it otherwise. For most sessions that’s fine. If you want to use the model’s full 128k context window, set it explicitly — but read the next section first.

For a persistent context setting, create a Modelfile:

cat > Modelfile <<'EOF'
FROM gpt-oss:20b
PARAMETER num_ctx 8192
EOF
ollama create gpt-oss-8k -f Modelfile
ollama run gpt-oss-8k

For 24 GB cards, 32768 context is reasonable without hitting the speed cliff:

PARAMETER num_ctx 32768

The 128k context trap

This is the number one complaint from users who pulled gpt-oss-20b on a 16 GB card and got confused.

What happens: Set context to 128k on an RTX 5060 Ti or RTX 4080 SUPER. Start generating. Speed drops to around 9 tok/s. Task Manager shows VRAM nearly empty.

Why it happens: The KV cache for 128k context (~20+ GB) doesn’t fit in 16 GB of VRAM. llama.cpp and Ollama fall back to system RAM for the KV cache, routing every attention lookup through PCIe instead of the GPU’s memory bus. The GPU sits idle waiting for data.

Fix: Cap context at 8k on 16 GB cards.

# In Ollama Modelfile:
PARAMETER num_ctx 8192

# Or in llama.cpp directly:
./llama-cli -m gpt-oss-20b.Q4_K_M.gguf --ctx-size 8192 -n 512

At 8k context on an RTX 3060 12 GB (with CPU offload), one community member reported going from 9 tok/s to 43 tok/s by setting this flag. The same fix applies proportionally on 16 GB cards that were seeing similar slowdowns at larger context values.

The table below gives practical context limits by card:

GPU VRAM	Safe context limit
12 GB	4k (offload mode)
16 GB	8k
24 GB	32k
32 GB	128k (native)

Setup with llama.cpp

If you want direct control, llama.cpp gives you more flags:

# Download the GGUF from Hugging Face
# (search: openai/gpt-oss-20b GGUF on HuggingFace)

./llama-server \
  -m gpt-oss-20b.Q4_K_M.gguf \
  --ctx-size 8192 \
  --n-gpu-layers 999 \
  --port 8080

--n-gpu-layers 999 tells llama.cpp to push as many layers as possible onto the GPU. On a 16 GB card this will load the full model at 8k context. Check the startup logs: if you see offloaded X/33 layers to GPU where X is less than 33, some layers are going to CPU.

For 24 GB cards, bump --ctx-size to 16384 or 32768 and you’ll still get full GPU utilization.

gpt-oss-120b: not for home labs

The 120B model has 117 billion total parameters and activates 5.1 billion per token — the same MoE trick, larger pool. OpenAI says it fits on a single 80 GB GPU (H100 or MI300X) in MXFP4 form.

At Q4_K_M quantization, the weight file alone is 72.7 GB. With compute buffers and any KV cache, you need 80+ GB of actual VRAM. That means:

✅ H100 80 GB (data center)
✅ AMD MI300X 192 GB
✅ Two RTX 5090s in NVLink (64 GB total — tight)
❌ Every consumer card below $5,000

If you want near-o4-mini reasoning locally and have access to data center hardware or a RunPod instance, gpt-oss-120b is worth it. For everyone else: gpt-oss-20b running at 225 tok/s on a single RTX 4090 is the practical choice.

For cloud testing at scale before committing to hardware, RunPod has H100 instances at competitive rates.

How gpt-oss-20b compares to other 20B-class models

Model	VRAM floor	RTX 3090 speed	Strengths
gpt-oss-20b	16 GB	161 tok/s	Reasoning, tool use, OpenAI training
Gemma 4 12B	8 GB	~280+ tok/s	Faster, runs on any 8 GB card
Qwen3-30B-A3B	24 GB	~130 tok/s	Coding, multilingual
Mistral Small 4	64 GB+	N/A (too large)	Stronger on long tasks

Gemma 4 12B wins on accessibility and raw tok/s. Qwen3-30B-A3B is the coding pick for 24 GB cards. gpt-oss-20b wins when you specifically want the o-series reasoning style — chain-of-thought that works, with native tool calling — and you have at least 16 GB.

The honest comparison: gpt-oss-20b is slower than Gemma 4 12B at identical hardware, but the reasoning quality gap on complex multi-step tasks is real. For quick Q&A, use Gemma. For code debugging, agent loops, or anything where you’d previously have reached for o3-mini, gpt-oss-20b holds its own.

GPU buying guide for this model

Already have a 16 GB card (RTX 5060 Ti / 4080 SUPER / 5070 Ti): You’re set. Use 8k context. Expect 111–189 tok/s.

Used RTX 3090 (24 GB, ~$600–1,050 on eBay, June 2026): The value play. 161 tok/s and 24 GB lets you stretch to 32k context. Our full RTX 3090 analysis is here.

RTX 4090 (~$2,300 used / $2,755 new, June 2026): 225 tok/s, 24 GB. Best single-card setup for gpt-oss-20b on consumer hardware.

RTX 5060 Ti 16GB (MSRP $429, current $499–569): The budget entry. 111 tok/s is fast enough for interactive use. Keep context at 8k and it’s a solid daily driver.

For comparison: cloud inference of o3-mini at OpenAI’s API rates would cost roughly $0.55–$1.10 per million output tokens. At 111 tok/s locally, you’re generating ~400k tokens per hour on an RTX 5060 Ti. That’s roughly $220–440/hour in cloud cost savings if you’re running it heavily. The hardware pays off fast if you’re a daily user.

FAQ

Does gpt-oss-20b support structured output / JSON mode? Yes. Tool use, function calling, JSON-mode structured outputs, and web search are all native capabilities from training. Support in Ollama for tool use was added in Ollama 0.6+. Use the /api/chat endpoint with a tools array in the request.

Can I run it on a Mac? Yes, via Ollama on Apple Silicon. MLX support exists in the community but isn’t first-party yet (as of June 2026). An M4 Pro Mac Mini (24 GB unified memory) will see similar speeds to an RTX 3090 — roughly 150–170 tok/s. An M4 Max (48 GB) can stretch to 64k context comfortably. See our Mac Mini M4 Pro local AI guide for the full breakdown.

Is the reasoning actually on-par with o3-mini? OpenAI’s model card says gpt-oss-20b matches o3-mini on their standard reasoning benchmarks. Third-party evals vary: community testing on coding tasks generally agrees it’s competitive at o3-mini level, which is meaningfully better than a standard dense 20B model. It won’t beat o4 or o3 on hard math, but it’s well ahead of base-instruct models of the same size.

Is Apache 2.0 really that different from Meta’s Llama license? Yes. Meta’s Llama license restricts commercial use above 700M monthly active users. Apache 2.0 has no such cap. You can embed gpt-oss-20b in a commercial product at any scale without additional licensing. This matters for companies, not so much for home-lab use — but it explains why it’s showing up in more production deployments than Llama alternatives.

Will gpt-oss-20b load on 12 GB VRAM? Technically yes, via CPU offloading. Practically: you’ll get 30–31 tok/s at best, and only at short context lengths. At that speed, you might as well run Gemma 4 12B natively on the 12 GB card at 3–4× the speed.

Sources

Prices and availability as of June 2026. GPU prices fluctuate; verify before purchasing.

Recommended Gear

NVIDIA RTX 4090 24GB — 225 tok/s, best single-card consumer option for gpt-oss-20b
NVIDIA RTX 3090 24GB — 161 tok/s, best value at $600–1,050 used
NVIDIA RTX 5060 Ti 16GB — 111 tok/s, $499–569, entry point for gpt-oss-20b

Was this article helpful?