Qwen3-30B-A3B Local AI Guide: 196 tok/s on One RTX 4090, and What MoE Means for Your GPU
The number that makes people assume they’ve misread something: an RTX 4090 running Qwen3-30B-A3B at up to 196 tokens per second. A 7B dense model on the same GPU benchmarks at around 135 tok/s. A dense 30B model wouldn’t even fit in 24GB VRAM at Q4 without extreme quantization. How does a 30B model go faster than a 7B? That’s the MoE architecture at work.
What’s happening is a Mixture-of-Experts (MoE) architecture that fundamentally changes how local inference scales — and unless you understand it, you’ll either write off this model as “too big for consumer hardware” or reach for it when the dense 32B is actually the smarter choice. This is the practical guide: how the architecture works, what VRAM you need, real tok/s numbers across GPU tiers, and exactly how to run it.
What “30B-A3B” Means, and Why Your Bandwidth Math Is Wrong
For dense LLMs, the bandwidth rule is simple: each token generated requires reading every parameter in the model from VRAM. A 30B dense model at Q4 needs to move roughly 15 GB of data through the GPU’s memory bus per token. That’s why a 30B model runs at roughly one-fifth the speed of a 7B on the same GPU.
Qwen3-30B-A3B breaks this rule with a sparse architecture. Inside the model are 128 separate “expert” networks. For each token, a lightweight routing layer examines the current hidden state and selects only 8 of those 128 experts to activate. Every other expert sits idle — its weights sit in VRAM but don’t touch the compute units.
The practical breakdown:
- Total parameters: 30.5 billion (all experts combined)
- Active parameters per token: 3.3 billion (8 of 128 experts)
- Effective inference cost per token: similar to a 4B–8B dense model
- VRAM footprint: still sized for the full 30.5B (everything must be loaded)
This is the MoE bargain: you pay the VRAM cost of a 30B model, but you get the generation speed of a model three to five times smaller. Quality sits somewhere between the two, shaped by the fact that during training, those 128 experts developed genuine specialization — routing pushes different types of reasoning to different expert subsets.
The full architecture: 48 transformer layers, 32 query attention heads with 4 KV heads (grouped query attention), 128 total experts with 8 activated per forward pass. Native context is 32,768 tokens, extended to 131,072 tokens with YaRN RoPE scaling. License is Apache 2.0, which means commercial use without royalties.
For comparison, Llama 3.3 70B — the other strong 24GB-runnable option — is a standard dense model where all 70B parameters load into VRAM and participate in every token. Fitting it in 24GB requires heavy quantization (Q3 or aggressive Q2), which costs noticeably more quality than Q4_K_M on a smaller model.
VRAM Requirements by Quantization Level
The “30B” label on the tin causes unnecessary hardware anxiety. At Q4_K_M, the weights sit at roughly 19 GB — comfortably inside a 24GB GPU with headroom for a reasonable KV cache.
| Quantization | File Size | Minimum VRAM | Fits On |
|---|---|---|---|
| Q4_K_M (default) | ~19 GB | 24 GB | RTX 4090, RTX 3090, RTX 4080 |
| Q5_K_M | ~22 GB | 24 GB | RTX 4090/3090; tight — reduces KV cache room |
| Q8_0 | ~31 GB | 40+ GB or dual 24GB | Used A6000 (48GB), Mac Studio 64GB+ |
| BF16 (full precision) | ~61 GB | 80 GB+ | H100, multi-GPU with NVLink |
A few cards that won’t work cleanly:
RTX 4060 Ti 16GB: Q4_K_M doesn’t fit (19 GB > 16 GB). You can use --n-gpu-layers in llama.cpp to keep the first N layers on GPU and offload the rest to system RAM — but the PCIe bottleneck between GPU and system RAM guts your tok/s, and you’ve lost the main reason to run this model over Qwen3-14B.
RTX 3060 12GB: Not viable at any useful quantization. The model needs 19 GB just for weights; the card has 12 GB. Full CPU offload would result in sub-5 tok/s performance, slower than Qwen3-8B running entirely on GPU.
Mac Studio M2/M3 with 64GB unified memory: Works cleanly at Q4_K_M (19 GB of 64 GB used) via MLX. Mac Studio 96GB has comfortable headroom for Q8 inference.
If you’re on 16GB, the right model is Qwen3-14B or Qwen3-8B, not this one.
Tokens Per Second Across GPU Tiers
Community benchmarks from April–May 2026 using Q4_K_M quantization in llama.cpp:
| GPU | VRAM | Memory Bandwidth | Qwen3-30B-A3B tok/s |
|---|---|---|---|
| RTX 4090 | 24 GB | 1,008 GB/s | 120–196 tok/s |
| RTX 3090 | 24 GB | 936 GB/s | ~73 tok/s |
| RTX 4060 Ti 16GB | 16 GB | 288 GB/s | Not recommended (CPU offload required; PCIe bottleneck kills throughput) |
| RTX 3060 12GB | 12 GB | 360 GB/s | Not viable (weights exceed VRAM by 7+ GB) |
The RTX 4090 range (120–196 tok/s) reflects variation across test conditions: different quant variants (Q4_K_M vs Unsloth UD-Q4_K_XL), context window sizes, and whether llama.cpp or Ollama is the inference backend. Ollama adds a Go server layer that typically costs 3–10% throughput compared to raw llama.cpp; the upper bound (196 tok/s) comes from optimized llama.cpp setups with modest context windows.
The RTX 3090 figure (73 tok/s) is lower than many expect given it’s only 7% slower than the RTX 4090 on memory bandwidth. MoE inference involves more irregular memory access patterns than dense models — the routing mechanism causes non-contiguous expert weight reads — which appears to amplify the sensitivity to GPU architecture differences beyond raw bandwidth numbers.
To put the speed in perspective: a dense Qwen3-32B model on an RTX 4090 runs substantially slower, because all 32B parameters load from VRAM for every token — at Q4_K_M the weights alone occupy ~19 GB, leaving very little KV cache headroom at 24 GB. Community benchmarks consistently report the MoE 30B-A3B running 3–5× faster than the dense 32B on the same GPU, at the cost of around 2–3 points on standard benchmarks.
Quality Benchmarks: The Honest Numbers
Qwen3-30B-A3B vs Qwen3-32B dense — the direct matchup that matters for 24GB GPU owners:
| Benchmark | Qwen3-30B-A3B | Qwen3-32B (dense) |
|---|---|---|
| MMLU | 81.38 | ~83 |
| Arena Hard | 91.0% | 93.8% |
| AIME 2024 | 80.4% | 81.4% |
| AIME 2025 | 70.9% | ~72% |
The gap is real but narrow: the dense 32B holds a 1–3 point advantage across benchmarks. Both models blow past Llama 3.3 70B on math — Llama 3.3 70B scores MATH 77.0% (MATH benchmark) and has a strong 88.4% on HumanEval, but its reasoning under AIME-style competition math is significantly weaker than either Qwen3 variant. Qwen3’s training methodology produces genuinely better mathematical reasoning at comparable or smaller VRAM footprints.
In everyday chat and coding tasks you won’t notice the 2–3 point quality gap between the two Qwen3 models. In structured math problems or complex multi-step reasoning, the dense 32B will occasionally produce a more complete chain of reasoning. Whether that marginal accuracy gain is worth 3–5× slower generation is a use-case question, answered below.
Thinking Mode: /think and /no_think
Qwen3-30B-A3B includes a built-in thinking mode switch — you don’t need a separate model file. Add /think anywhere in a prompt to activate extended chain-of-thought reasoning. The model generates its reasoning inside <think>...</think> tags before producing the final response. Add /no_think to turn it off within the same session.
When thinking mode helps:
- Multi-step math problems where you’d spot-check the intermediate steps
- Code debugging with non-obvious logic errors
- Planning tasks where explicit tradeoff reasoning is valuable
- Anything you’d hand to a smart person and ask them to “walk me through your reasoning”
When to leave it off:
- General chat and summarization (adds latency with no quality gain)
- Simple code generation from a clear specification
- Creative writing and brainstorming
- RAG retrieval and document Q&A
The speed cost of thinking mode is real. On an RTX 4090, a complex math problem generates several hundred to several thousand thinking tokens inside the <think> block before the answer appears — adding noticeable latency compared to non-thinking mode. For interactive chat, this feels slow. For a workflow that runs overnight batch tasks, it’s fine.
The most practical local configuration: add /no_think to your system prompt so thinking is disabled by default. Then explicitly add /think in user-facing prompts when you want the deep reasoning path. This keeps latency low for routine queries without losing the capability.
How to Run Qwen3-30B-A3B Locally
Ollama — simplest setup:
ollama run qwen3:30b-a3b
This pulls the Q4_K_M quantized version (~19 GB download). Ollama manages context and GPU layers automatically.
For a higher-precision variant:
ollama run qwen3:30b-a3b-q8_0
One important caveat: as of May 2026, there’s a documented Ollama performance regression affecting Qwen3 models introduced between versions 0.15.5 and 0.15.6. Some users on RTX 3090 reported tok/s dropping from ~35 to ~12 after updating. If your generation feels unexpectedly slow, test with llama.cpp directly to isolate whether Ollama is the bottleneck. The issue is tracked in GitHub issue #14740.
llama.cpp — best raw throughput:
Download a GGUF from Hugging Face (Bartowski’s quantized uploads are the most actively maintained for Qwen3). Then:
./llama-server \
-m qwen3-30b-a3b-q4_k_m.gguf \
-ngl 99 \
-c 8192 \
--port 11434
-ngl 99 offloads all 48 layers to GPU. If you’re partially CPU-offloading on a 16GB card, drop this to whatever layer count fits your VRAM (trial-and-error: start at 30, increase until you hit an OOM).
LM Studio and Jan.ai: both support standard GGUF download. Search “Qwen3-30B-A3B” and select any Q4_K_M variant from Bartowski or Unsloth’s uploads. Jan.ai reports 17.5 GB VRAM usage for their optimized UD-Q4_K_M variant (Unsloth Dynamic quant), which leaves more headroom for context than a standard Q4_K_M.
For a deeper comparison of Ollama vs llama.cpp performance at different concurrency levels, see vLLM vs Ollama: When Each One Wins.
MoE 30B-A3B or Dense 32B? The Decision Matrix
| Your situation | Better choice |
|---|---|
| RTX 4090, prioritize fastest responses | Qwen3-30B-A3B MoE |
| RTX 4090, prioritize best benchmark accuracy | Qwen3-32B dense |
| RTX 3090, single-user chat | Qwen3-30B-A3B (~73 tok/s vs ~33 tok/s for 32B dense) |
| RTX 3090, heavy math/coding with thinking mode | Qwen3-32B (slightly higher accuracy, slower is acceptable) |
| Serving multiple users from one GPU | Qwen3-30B-A3B (higher throughput per user) |
| 16GB GPU | Qwen3-14B (skip both 24GB models) |
| Mac Studio 64GB+ | Qwen3-30B-A3B at Q4 (clean fit via MLX) |
| Mac Studio 96GB+ | Either; consider Qwen3-32B Q8 for near-lossless quality |
The short version: if you run interactive local chat or a local API where response latency matters, the MoE model’s 3–5× throughput advantage is the deciding factor. If you’re batch-processing tasks overnight where latency doesn’t matter, the dense 32B’s slightly higher accuracy is worth it.
Honest Take
Qwen3-30B-A3B is the most practically useful model for 24GB local inference in mid-2026. The combination of fast generation, built-in thinking mode, 131K context (with YaRN), Apache 2.0 licensing, and strong multilingual support covers the typical home lab use case better than anything in the same VRAM tier.
The real constraint is the hard floor at 24GB VRAM. If you’re on a 16GB card, this model is not a downgrade path — the CPU offload required eliminates the speed advantage that makes it interesting. Qwen3-14B is the right answer there, running at full GPU speed on a 16GB card at Q4_K_M. For the RTX 4060 Ti 16GB, see the comparison with the RTX 3090 to decide whether a 24GB upgrade makes sense for your budget.
One quantization warning worth taking seriously: community tests show that MoE models like the 30B-A3B are more sensitive to low-bit quantization than dense models. Q4_K_M produces good output; Q3 and below show quality degradation faster than you’d expect from a comparable dense model. Stick to Q4_K_M or higher.
If you’re comparing cloud API costs against running this model locally, the math in Llama 3.3 70B at Home: Real Hardware Cost vs Cloud API translates directly to Qwen3-30B-A3B — substitute the faster tok/s and lower per-token inference cost, and the break-even point shifts favorably toward local hardware even sooner.
1V1 PLAYBOOK · LOCAL LLM
Cut your local AI bill from $400/month cloud GPU to $47/month at home.
4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.
Get it for $19 (early bird) →Sources
- Qwen3 30B-A3B on Ollama Library — model tags and architecture specs
- How to Run Qwen3 on Ollama: All Sizes, Thinking Mode and Hardware Guide — Serverman
- Qwen3-30B-A3B MMLU and benchmark scores — LLM Stats
- Qwen3 30B-A3B vs Qwen3 32B Comparison — Galaxy.ai
- Best Local LLMs for Consumer Hardware 2026: Llama 3.3 70B vs Qwen3 30B-A3B — PromptZone
- Qwen3 Technical Report — ArXiv 2505.09388
- Qwen3 235B and 30B MoE Quant Benchmarking Roundup — ubergarm/GitHub Gist
- Qwen3 30B-A3B vs QWQ-32B Performance Analysis — Novita AI
- Ollama Qwen3 performance regression — GitHub Issue #14740
- Home GPU LLM Leaderboard: Best Open Source Models by VRAM Tier — Awesome Agents
- Llama 3.3 70B Instruct benchmarks: MATH 77.0%, HumanEval 88.4% — LLM Stats
- Local LLM Tokens/Sec: Real Benchmarks for RTX 4090, 3090, 4060 Ti, 3060 — Mustafa.net
- GPU memory bandwidth specifications — NVIDIA Product Pages
Last updated May 24, 2026. Model performance varies by inference backend, quantization, and hardware configuration; verify benchmarks against your specific setup before purchasing hardware.
Recommended Gear
The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →