Qwen3-30B-A3B Local AI Guide: 196 tok/s on One RTX 4090, and What MoE Means for Your GPU

qwen3moelocal-airtx-4090benchmarktokens-per-secondollamathinking-mode

The number that makes people assume they’ve misread something: an RTX 4090 running Qwen3-30B-A3B at up to 196 tokens per second. A 7B dense model on the same GPU benchmarks at around 135 tok/s. A dense 30B model wouldn’t even fit in 24GB VRAM at Q4 without extreme quantization. How does a 30B model go faster than a 7B? That’s the MoE architecture at work.

What’s happening is a Mixture-of-Experts (MoE) architecture that fundamentally changes how local inference scales — and unless you understand it, you’ll either write off this model as “too big for consumer hardware” or reach for it when the dense 32B is actually the smarter choice. This is the practical guide: how the architecture works, what VRAM you need, real tok/s numbers across GPU tiers, and exactly how to run it.

What “30B-A3B” Means, and Why Your Bandwidth Math Is Wrong

For dense LLMs, the bandwidth rule is simple: each token generated requires reading every parameter in the model from VRAM. A 30B dense model at Q4 needs to move roughly 15 GB of data through the GPU’s memory bus per token. That’s why a 30B model runs at roughly one-fifth the speed of a 7B on the same GPU.

Qwen3-30B-A3B breaks this rule with a sparse architecture. Inside the model are 128 separate “expert” networks. For each token, a lightweight routing layer examines the current hidden state and selects only 8 of those 128 experts to activate. Every other expert sits idle — its weights sit in VRAM but don’t touch the compute units.

The practical breakdown:

  • Total parameters: 30.5 billion (all experts combined)
  • Active parameters per token: 3.3 billion (8 of 128 experts)
  • Effective inference cost per token: similar to a 4B–8B dense model
  • VRAM footprint: still sized for the full 30.5B (everything must be loaded)

This is the MoE bargain: you pay the VRAM cost of a 30B model, but you get the generation speed of a model three to five times smaller. Quality sits somewhere between the two, shaped by the fact that during training, those 128 experts developed genuine specialization — routing pushes different types of reasoning to different expert subsets.

The full architecture: 48 transformer layers, 32 query attention heads with 4 KV heads (grouped query attention), 128 total experts with 8 activated per forward pass. Native context is 32,768 tokens, extended to 131,072 tokens with YaRN RoPE scaling. License is Apache 2.0, which means commercial use without royalties.

For comparison, Llama 3.3 70B — the other strong 24GB-runnable option — is a standard dense model where all 70B parameters load into VRAM and participate in every token. Fitting it in 24GB requires heavy quantization (Q3 or aggressive Q2), which costs noticeably more quality than Q4_K_M on a smaller model.

VRAM Requirements by Quantization Level

The “30B” label on the tin causes unnecessary hardware anxiety. At Q4_K_M, the weights sit at roughly 19 GB — comfortably inside a 24GB GPU with headroom for a reasonable KV cache.

QuantizationFile SizeMinimum VRAMFits On
Q4_K_M (default)~19 GB24 GBRTX 4090, RTX 3090, RTX 4080
Q5_K_M~22 GB24 GBRTX 4090/3090; tight — reduces KV cache room
Q8_0~31 GB40+ GB or dual 24GBUsed A6000 (48GB), Mac Studio 64GB+
BF16 (full precision)~61 GB80 GB+H100, multi-GPU with NVLink

A few cards that won’t work cleanly:

RTX 4060 Ti 16GB: Q4_K_M doesn’t fit (19 GB > 16 GB). You can use --n-gpu-layers in llama.cpp to keep the first N layers on GPU and offload the rest to system RAM — but the PCIe bottleneck between GPU and system RAM guts your tok/s, and you’ve lost the main reason to run this model over Qwen3-14B.

RTX 3060 12GB: Not viable at any useful quantization. The model needs 19 GB just for weights; the card has 12 GB. Full CPU offload would result in sub-5 tok/s performance, slower than Qwen3-8B running entirely on GPU.

Mac Studio M2/M3 with 64GB unified memory: Works cleanly at Q4_K_M (19 GB of 64 GB used) via MLX. Mac Studio 96GB has comfortable headroom for Q8 inference.

If you’re on 16GB, the right model is Qwen3-14B or Qwen3-8B, not this one.

Tokens Per Second Across GPU Tiers

Community benchmarks from April–May 2026 using Q4_K_M quantization in llama.cpp:

GPUVRAMMemory BandwidthQwen3-30B-A3B tok/s
RTX 409024 GB1,008 GB/s120–196 tok/s
RTX 309024 GB936 GB/s~73 tok/s
RTX 4060 Ti 16GB16 GB288 GB/sNot recommended (CPU offload required; PCIe bottleneck kills throughput)
RTX 3060 12GB12 GB360 GB/sNot viable (weights exceed VRAM by 7+ GB)

The RTX 4090 range (120–196 tok/s) reflects variation across test conditions: different quant variants (Q4_K_M vs Unsloth UD-Q4_K_XL), context window sizes, and whether llama.cpp or Ollama is the inference backend. Ollama adds a Go server layer that typically costs 3–10% throughput compared to raw llama.cpp; the upper bound (196 tok/s) comes from optimized llama.cpp setups with modest context windows.

The RTX 3090 figure (73 tok/s) is lower than many expect given it’s only 7% slower than the RTX 4090 on memory bandwidth. MoE inference involves more irregular memory access patterns than dense models — the routing mechanism causes non-contiguous expert weight reads — which appears to amplify the sensitivity to GPU architecture differences beyond raw bandwidth numbers.

To put the speed in perspective: a dense Qwen3-32B model on an RTX 4090 runs substantially slower, because all 32B parameters load from VRAM for every token — at Q4_K_M the weights alone occupy ~19 GB, leaving very little KV cache headroom at 24 GB. Community benchmarks consistently report the MoE 30B-A3B running 3–5× faster than the dense 32B on the same GPU, at the cost of around 2–3 points on standard benchmarks.

Quality Benchmarks: The Honest Numbers

Qwen3-30B-A3B vs Qwen3-32B dense — the direct matchup that matters for 24GB GPU owners:

BenchmarkQwen3-30B-A3BQwen3-32B (dense)
MMLU81.38~83
Arena Hard91.0%93.8%
AIME 202480.4%81.4%
AIME 202570.9%~72%

The gap is real but narrow: the dense 32B holds a 1–3 point advantage across benchmarks. Both models blow past Llama 3.3 70B on math — Llama 3.3 70B scores MATH 77.0% (MATH benchmark) and has a strong 88.4% on HumanEval, but its reasoning under AIME-style competition math is significantly weaker than either Qwen3 variant. Qwen3’s training methodology produces genuinely better mathematical reasoning at comparable or smaller VRAM footprints.

In everyday chat and coding tasks you won’t notice the 2–3 point quality gap between the two Qwen3 models. In structured math problems or complex multi-step reasoning, the dense 32B will occasionally produce a more complete chain of reasoning. Whether that marginal accuracy gain is worth 3–5× slower generation is a use-case question, answered below.

Thinking Mode: /think and /no_think

Qwen3-30B-A3B includes a built-in thinking mode switch — you don’t need a separate model file. Add /think anywhere in a prompt to activate extended chain-of-thought reasoning. The model generates its reasoning inside <think>...</think> tags before producing the final response. Add /no_think to turn it off within the same session.

When thinking mode helps:

  • Multi-step math problems where you’d spot-check the intermediate steps
  • Code debugging with non-obvious logic errors
  • Planning tasks where explicit tradeoff reasoning is valuable
  • Anything you’d hand to a smart person and ask them to “walk me through your reasoning”

When to leave it off:

  • General chat and summarization (adds latency with no quality gain)
  • Simple code generation from a clear specification
  • Creative writing and brainstorming
  • RAG retrieval and document Q&A

The speed cost of thinking mode is real. On an RTX 4090, a complex math problem generates several hundred to several thousand thinking tokens inside the <think> block before the answer appears — adding noticeable latency compared to non-thinking mode. For interactive chat, this feels slow. For a workflow that runs overnight batch tasks, it’s fine.

The most practical local configuration: add /no_think to your system prompt so thinking is disabled by default. Then explicitly add /think in user-facing prompts when you want the deep reasoning path. This keeps latency low for routine queries without losing the capability.

How to Run Qwen3-30B-A3B Locally

Ollama — simplest setup:

ollama run qwen3:30b-a3b

This pulls the Q4_K_M quantized version (~19 GB download). Ollama manages context and GPU layers automatically.

For a higher-precision variant:

ollama run qwen3:30b-a3b-q8_0

One important caveat: as of May 2026, there’s a documented Ollama performance regression affecting Qwen3 models introduced between versions 0.15.5 and 0.15.6. Some users on RTX 3090 reported tok/s dropping from ~35 to ~12 after updating. If your generation feels unexpectedly slow, test with llama.cpp directly to isolate whether Ollama is the bottleneck. The issue is tracked in GitHub issue #14740.

llama.cpp — best raw throughput:

Download a GGUF from Hugging Face (Bartowski’s quantized uploads are the most actively maintained for Qwen3). Then:

./llama-server \
  -m qwen3-30b-a3b-q4_k_m.gguf \
  -ngl 99 \
  -c 8192 \
  --port 11434

-ngl 99 offloads all 48 layers to GPU. If you’re partially CPU-offloading on a 16GB card, drop this to whatever layer count fits your VRAM (trial-and-error: start at 30, increase until you hit an OOM).

LM Studio and Jan.ai: both support standard GGUF download. Search “Qwen3-30B-A3B” and select any Q4_K_M variant from Bartowski or Unsloth’s uploads. Jan.ai reports 17.5 GB VRAM usage for their optimized UD-Q4_K_M variant (Unsloth Dynamic quant), which leaves more headroom for context than a standard Q4_K_M.

For a deeper comparison of Ollama vs llama.cpp performance at different concurrency levels, see vLLM vs Ollama: When Each One Wins.

MoE 30B-A3B or Dense 32B? The Decision Matrix

Your situationBetter choice
RTX 4090, prioritize fastest responsesQwen3-30B-A3B MoE
RTX 4090, prioritize best benchmark accuracyQwen3-32B dense
RTX 3090, single-user chatQwen3-30B-A3B (~73 tok/s vs ~33 tok/s for 32B dense)
RTX 3090, heavy math/coding with thinking modeQwen3-32B (slightly higher accuracy, slower is acceptable)
Serving multiple users from one GPUQwen3-30B-A3B (higher throughput per user)
16GB GPUQwen3-14B (skip both 24GB models)
Mac Studio 64GB+Qwen3-30B-A3B at Q4 (clean fit via MLX)
Mac Studio 96GB+Either; consider Qwen3-32B Q8 for near-lossless quality

The short version: if you run interactive local chat or a local API where response latency matters, the MoE model’s 3–5× throughput advantage is the deciding factor. If you’re batch-processing tasks overnight where latency doesn’t matter, the dense 32B’s slightly higher accuracy is worth it.

Honest Take

Qwen3-30B-A3B is the most practically useful model for 24GB local inference in mid-2026. The combination of fast generation, built-in thinking mode, 131K context (with YaRN), Apache 2.0 licensing, and strong multilingual support covers the typical home lab use case better than anything in the same VRAM tier.

The real constraint is the hard floor at 24GB VRAM. If you’re on a 16GB card, this model is not a downgrade path — the CPU offload required eliminates the speed advantage that makes it interesting. Qwen3-14B is the right answer there, running at full GPU speed on a 16GB card at Q4_K_M. For the RTX 4060 Ti 16GB, see the comparison with the RTX 3090 to decide whether a 24GB upgrade makes sense for your budget.

One quantization warning worth taking seriously: community tests show that MoE models like the 30B-A3B are more sensitive to low-bit quantization than dense models. Q4_K_M produces good output; Q3 and below show quality degradation faster than you’d expect from a comparable dense model. Stick to Q4_K_M or higher.

If you’re comparing cloud API costs against running this model locally, the math in Llama 3.3 70B at Home: Real Hardware Cost vs Cloud API translates directly to Qwen3-30B-A3B — substitute the faster tok/s and lower per-token inference cost, and the break-even point shifts favorably toward local hardware even sooner.

1V1 PLAYBOOK · LOCAL LLM

Cut your local AI bill from $400/month cloud GPU to $47/month at home.

4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.

Get it for $19 (early bird) →

Sources

Last updated May 24, 2026. Model performance varies by inference backend, quantization, and hardware configuration; verify benchmarks against your specific setup before purchasing hardware.


The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):

Was this article helpful?