Jun 8, 2026

Qwen 3.6 35B-A3B for Local AI in 2026: The 24GB VRAM Line That Gets You 120 tok/s

By RunAIHome Team · 13 min read

qwenlocal-llmgpuvramcoding-modelmoe

TL;DR: Qwen 3.6 35B-A3B is a Mixture-of-Experts model that costs only 3B parameters of compute per token — but all 35B must live in VRAM simultaneously, setting a hard 24GB floor. On a 24GB RTX 4090 it reaches 120 tok/s at Q4_K_M; a used RTX 3090 gets 107 tok/s with Ollama, or 135.7 tok/s with a properly tuned llama.cpp config. Anything with 16GB VRAM needs CPU offloading that cuts speed by half.

	RTX 4090 24GB	RTX 3090 24GB	Mac Mini M4 Pro 24GB
Best for	Fastest local option	Budget 24GB route	Quiet, no discrete GPU
Q4_K_M (21 GB)	✅ 120 tok/s	✅ 107 tok/s	✅ 15–22 tok/s
Q5_K_M (25 GB)	❌ Overflow	❌ Overflow	❌ Overflow
Street price (Jun 2026)	~$2,300 used	~$1,050 used	$1,399

Honest take: If you own a 24GB GPU, Qwen 3.6 35B-A3B is worth trying for coding. If you have 16GB, the Qwen 3.6 27B dense sibling fits natively, scores higher on SWE-bench (77.2% vs 73.4%), and runs faster on the same hardware — pick that one instead.

Alibaba released Qwen3.6-35B-A3B on April 16, 2026 under Apache 2.0. The headline number is 73.4% on SWE-bench Verified (Alibaba’s internal agent scaffold), which puts it in the same neighborhood as frontier closed models. But the 24GB VRAM requirement is non-negotiable for usable performance, and understanding why that’s true requires understanding how MoE memory actually works.

The MoE memory trap

Most people hear “only 3B active parameters” and assume the model needs 3B worth of VRAM. It doesn’t.

Mixture-of-Experts routing is cheap at compute time: on each forward pass, the router selects a subset of expert layers — the equivalent of ~3B parameters — to process each token. Everything else sits idle. But idle does not mean absent. All 35B parameters must be loaded into memory before the router can decide which ones to activate.

The practical result: Qwen 3.6 35B-A3B has the inference speed of a 3B model but the VRAM footprint of a 35B model. That’s the MoE tradeoff. You get cheap compute in exchange for a fixed memory tax you cannot escape.

This is also why comparing to dense models requires care. The Qwen 3.6 27B dense activates all 27B parameters on every token, which sounds heavier — but at Q4_K_M it needs only ~16 GB, leaving 8 GB of headroom on a 24GB card. And it scores higher on SWE-bench. The 35B-A3B model wins on throughput efficiency and very long-context workloads; it doesn’t automatically win on coding quality. If you’re weighing the MoE options specifically, the older Qwen 3 30B-A3B is the lighter sibling — fewer total parameters (30.5B vs 35B) and a smaller ~19 GB footprint — while the dense 27B guide is the call if you want a dense model instead; treat the three as a graded cluster, not rivals. If you’re weighing Qwen3 against DeepSeek’s V4 family rather than within the Qwen lineup, DeepSeek V4 vs Qwen3 for local AI lays out the VRAM decision tree across both.

VRAM requirements by quantization

Weight-only VRAM, before KV cache:

Quantization	VRAM (weights only)	Headroom on 24GB card
Q3_K_M	~15.8 GB	8.2 GB (fits, quality loss)
Q4_0	~19.6 GB	4.4 GB
Q4_K_M	~21 GB	3 GB
UD-Q4_K_XL (Unsloth)	~22.1 GB	1.9 GB
Q5_K_M	~25.2 GB	❌ Overflow
Q6_K	~28.7 GB	❌ Overflow
Q8_0	~37 GB	Needs 40GB+
FP16	~71.8 GB	Server GPUs only

The Unsloth UD-Q4_K_XL is the community-preferred quant for 24GB GPUs: it uses Unsloth’s Dynamic quantization to preserve perplexity closer to Q5 while staying at 22.1 GB. The 1.9 GB of free VRAM after loading is tight for KV cache, which is why KV cache quantization in llama.cpp matters — it compresses the cache to reclaim headroom.

KV cache adds on top of weights:

8K context, Q8_0 KV cache: ~1.5 GB extra
16K context, Q8_0 KV cache: ~3 GB extra
32K context, Q8_0 KV cache: ~6 GB extra
262K context (full native): ~48 GB extra — no single consumer GPU

For interactive coding sessions, 16K context covers roughly 12,000 lines of code. Q4_K_M weights plus Q8_0 KV cache keeps you within 24GB. That handles the overwhelming majority of coding tasks. The 262K and 1M context claims are real but require a Mac with 96GB+ unified memory or multi-GPU setups.

GPU compatibility

24GB cards — where this model lives

RTX 4090 (~$2,300 used)

Fastest consumer option for this model. At Q4_K_M via Ollama: ~78 tok/s with time-to-first-token of ~3.2 seconds at 8K context. Switch to llama.cpp with UD-Q4_K_XL and flash attention enabled: 120+ tok/s. The 4090’s 1,008 GB/s bandwidth gives it a slight edge over the 3090 on sustained generation, though the gap is smaller than the bandwidth difference suggests — MoE routing is more memory-latency than bandwidth sensitive.

RTX 3090 (~$1,050 used, June 2026)

Same 24GB as the 4090, lower bandwidth (936 GB/s). Ollama gives ~107 tok/s. With llama.cpp’s UD-Q4_K_XL and --cache-type-k q8_0 --cache-type-v q8_0: 135.7 tok/s — 27% faster than Ollama on the same hardware. The RTX 3090 remains the best-value 24GB card for local AI. Full value analysis here.

Mac Mini M4 Pro 24GB ($1,399)

Unified memory means the GPU and CPU share the same 400+ GB/s bandwidth pool — architecturally clean for MoE routing since the router can page expert parameters without crossing a PCIe bus. The downside is absolute throughput: 15–22 tok/s at Q4_0, 8K–16K context. Enough for interactive coding; not for batch work. At 48GB ($1,599), you gain Q5_K_M support and can push context to 32K comfortably.

16GB cards — the wall

Every 16GB consumer GPU (RTX 5060 Ti, RTX 4060 Ti, RTX 5070, RTX 5080, RX 9070 XT) runs into the same issue: Q4_K_M needs 21 GB, and 16 < 21. Three paths forward:

Option 1: Q3_K_M (~15.8 GB) — fits natively on 16GB VRAM, but coding quality drops noticeably. On instruction-following and code completion tasks, Q3 typically loses 5–8 points versus Q4 on this architecture. Whether that’s acceptable depends on your workload.

Option 2: CPU layer offloading — with 64 GB+ system DDR5 and -ngl 20 (partial GPU offload in llama.cpp), expect 30–50 tok/s. Usable for background inference tasks, poor for interactive coding where latency matters.

Option 3: Use Qwen 3.6 27B dense — fits in ~16 GB at Q4_K_M, runs at ~140 tok/s on an RTX 4090 (faster than the 35B-A3B), and scores 77.2% SWE-bench. For pure coding on 16GB hardware, this is the right model. See the full 27B guide.

More VRAM (comfortable range)

RTX 5090 32GB — fits Q5_K_M with 7 GB of KV cache headroom. Community reports put it at 160–180 tok/s with llama.cpp. The step from Q4 to Q5 is essentially free since you’re using VRAM that would otherwise be idle.

Mac Studio M4 Max 36GB+ — comfortable at Q5_K_M, reaching ~30–40 tok/s. The bandwidth jump (546 GB/s vs 400 GB/s on Mac Mini M4 Pro) translates to roughly 2× generation speed on this model. Full comparison in the Mac Studio vs Mac Mini guide.

Benchmark: Ollama vs llama.cpp on RTX 3090

The RTX 3090 data makes the backend gap concrete:

Backend	Config	Speed (RTX 3090)
Ollama 0.20.7	Default Q4	107 tok/s
llama.cpp	UD-Q4_K_XL + `--cache-type-k q8_0 --cache-type-v q8_0`	135.7 tok/s (+27%)

The difference is KV cache quantization. Ollama doesn’t expose that knob; llama.cpp does. At 135.7 tok/s the model delivers tokens faster than most people read them — it feels instantaneous for single-user coding sessions.

For context on how this compares to other backends across different workloads, see the vLLM vs Ollama comparison. vLLM pulls ahead at multi-user concurrency but doesn’t add value for a single-user home setup.

Setup

Ollama (quickest start)

ollama pull qwen3.6:35b-a3b
ollama run qwen3.6:35b-a3b

Downloads the default Q4_K_M GGUF (~21 GB). Ollama sets a 256K context window in its model card, but the effective limit is your available VRAM minus the 21 GB of weights — realistically 8–16K context on a 24GB card before you need KV cache quantization.

Error you will hit: If you see repetitive or garbled output, your Ollama install is outdated. There was a CUDA output bug with Qwen 3.6 on builds before April 20, 2026. Check with ollama --version and update if you’re below 0.20.0.

llama.cpp (performance config)

# Download Unsloth's UD-Q4_K_XL GGUF (~22.1 GB)
huggingface-cli download unsloth/Qwen3.6-35B-A3B-GGUF \
  --include "UD-Q4_K_XL*" \
  --local-dir ./models/

# Serve with KV cache quantization — fits 16K context in 24GB VRAM
./llama-server \
  --model ./models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  -ngl 99 \
  --flash-attn \
  -c 16384 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --port 8080

-ngl 99 offloads all layers to GPU. --flash-attn is required to activate KV cache quantization — the --cache-type flags do nothing without it. -c 16384 (16K context) fits within 24GB with Q8_0 KV cache; push to 32768 if your VRAM monitor shows ~2 GB of free space after startup.

CPU offload fallback for 16GB GPUs

./llama-server \
  --model ./models/Qwen3.6-35B-A3B-Q4_K_M.gguf \
  -ngl 20 \
  -c 4096 \
  --port 8080

-ngl 20 offloads the first 20 layers to GPU and the remainder to system RAM. Start there, check nvidia-smi dmon for GPU utilization and VRAM usage, and increase -ngl until you saturate VRAM without overflowing. On a 16GB GPU + 64 GB DDR5, expect 30–50 tok/s and 4K context as your practical ceiling.

MTP (Multi-Token Prediction) — Blackwell only

Unsloth ships an MTP variant for speculative multi-token generation:

./llama-server \
  -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
  -ngl 99 \
  -c 8192 \
  -fa on \
  -np 1 \
  --spec-type draft-mtp \
  --spec-draft-n-max 2

This is a Blackwell-architecture optimization. Community testing across 19 configurations on an RTX 3090 (Ampere) showed zero net speedup — the MoE routing overhead cancels the draft gains on Ampere. Don’t bother with this flag on RTX 3090/4090; it’s for RTX 5090 and future hardware.

When cloud makes more sense

If you want to evaluate the model before investing in a 24GB GPU upgrade, RunPod rents an A40 48GB at $0.74/hour — that covers Q5_K_M at full quality with generous context headroom. For occasional testing (a few hours per week), cloud costs less than hardware. The break-even point versus a used RTX 3090 at $520 and typical usage is around 5–6 months of daily coding sessions.

See the RunPod vs local GPU break-even analysis for the full electricity + hardware amortization math.

If you’re using this model as a backend for AI coding tools like Cursor or Continue.dev, aicoderscope.com covers the editor-side configuration for connecting local API endpoints to those tools.

Qwen 3.6 35B-A3B vs 27B dense

	35B-A3B MoE	27B dense
VRAM at Q4_K_M	~21 GB	~16 GB
SWE-bench Verified	73.4%	77.2%
Speed (RTX 4090, Ollama)	~78 tok/s	~140 tok/s
16GB GPU support	Q3 or CPU offload	Native ✅
Native context	262K tokens	131K tokens
Multimodal	✅	✅

On a 24GB GPU, the 27B dense is faster, uses less VRAM, and beats the 35B-A3B on SWE-bench. The case for the 35B-A3B is narrow: it wins on very long context tasks (large codebase navigation over 100K tokens) and on mixed workloads where the low per-token compute cost adds up over thousands of requests. For straightforward coding assistance on a single-user machine, start with the 27B.

FAQ

Can I run this on a laptop?
Only on laptops with 24GB VRAM, which currently means workstation-class mobile GPUs. Most consumer laptops top out at 16GB, so Qwen 3.6 27B dense is the practical choice there.

Why is the 3090 sometimes faster than the 4090 in community benchmarks?
The confusion comes from comparing Ollama on the 4090 (~78 tok/s) against llama.cpp on the 3090 (~135 tok/s). The backend matters more than the GPU here. With the same llama.cpp config, the 4090 is faster by ~10–15%. With Ollama on both, the 4090 edges out the 3090 due to higher bandwidth.

Does it support ROCm (AMD GPUs)?
Yes, via llama.cpp with ROCm backend. ROCm 7.2 on Ubuntu 24.04 is the current stable setup. The RX 7900 XTX at 24GB VRAM is the AMD card to target; performance typically lands at 80–90 tok/s for Q4_K_M, behind the RTX 3090 due to ROCm overhead relative to CUDA.

What’s the actual 73.4% SWE-bench number worth?
It’s measured with Alibaba’s internal agent scaffold, not the public harness. Third-party reproductions on the standard scaffold land 5–10 points lower. The independent anchor: Claude Sonnet 4.5 at 77.2% SWE-bench Verified (public harness). Qwen 3.6 35B-A3B is a strong open-weight coding model; it’s not ahead of frontier closed models.

Does speculative decoding help on RTX 3090?
No. Community benchmarking of 19 configurations (ngram-cache, ngram-mod, and classic draft with a vocab-matched Qwen3.5-0.8B draft model) found no net speedup on Ampere hardware. MoE routing overhead dominates the draft gain. Skip the MTP flags on Ampere; the benefit shows up on Blackwell.

Is the 1M context claim real?
The native context window is 262,144 tokens; it’s extensible to 1,010,000 via YaRN-style position interpolation, which Alibaba confirmed in the model card. At those lengths you need 48+ GB of KV cache on top of the 21 GB of weights — that’s Mac Studio M4 Max territory with 96+ GB unified memory, or multi-GPU. For anything under 32K context, you’re in normal consumer hardware range.

Recommended Gear

RTX 4090 24GB — fastest consumer GPU for this model
RTX 3090 24GB — best value 24GB card on the used market
RTX 5090 32GB — Q5_K_M fits with 7 GB to spare, 160+ tok/s

Sources

Last updated June 8, 2026. Hardware prices shift weekly; verify current listings before purchasing.

Was this article helpful?