Qwen3.6-27B for Local AI in 2026: Which GPU Runs It and What Speed to Expect

qwenlocal-llmgpuvramllama.cppollamainferencecoding2026

TL;DR: Qwen3.6-27B is a dense 27B model released April 22, 2026 that scores 77.2 on SWE-bench Verified — beating Alibaba’s own 397B MoE on coding benchmarks. At Q4_K_M it needs 16.8 GB of VRAM. An RTX 3090 runs it at ~40 tok/s; an RTX 4090 reaches ~70 tok/s. The 16 GB cards (RTX 5060 Ti, RTX 4080) can run it, but at constrained context windows.

RTX 5060 Ti 16GBRTX 3090 24GBRTX 4090 24GB
Best forEntry / budget-firstAI value workstationSpeed-first workflow
Price (May 2026)~$429 new~$1,050 used~$1,650
Q4_K_M headroomTight (at VRAM limit)ComfortableComfortable
Approx. tok/s Q4_K_M~31~40~70
Usable context~65K tokens~125K tokens~125K tokens
The catchContext limited, no slack5-yr-old archHigh cost

Honest take: The RTX 3090 is the value pick. It runs Q5_K_M with full context, costs $600 less than the 4090, and produces the same model class output. If you already own a 16 GB card, the 27B dense model is also viable — just keep your context windows short.


What makes Qwen3.6-27B different

Alibaba’s Qwen team released Qwen3.6-27B on April 22, 2026 under an Apache 2.0 license. The notable claim: a 27-billion-parameter dense model that beats the prior-generation Qwen3.5-397B-A17B — a 397B MoE with 17B active parameters — on every major agentic coding benchmark.

The benchmark scores that landed it on Hacker News:

BenchmarkQwen3.6-27BQwen3.5-397B-A17B (prior flagship)
SWE-bench Verified77.276.2
SWE-bench Pro53.5
Terminal-Bench 2.059.3
GPQA Diamond87.8
QwenWebBench14871068 (Qwen3.5-27B)

SWE-bench Verified is the closest thing the ML community has to a real-world coding exam — it asks a model to write a patch that fixes an actual GitHub issue in a real repository. A score of 77.2 puts this 27B model in the same range as hosted frontier models, not just “good for its size.”

The architecture is hybrid: 64 layers alternating Gated DeltaNet blocks and standard attention, with a multi-token prediction head that enables speculative decoding. Native context is 262,144 tokens, extensible to ~1,010,000 via YaRN. It’s multimodal (vision-language), though the vision path has limitations in Ollama — more on that below.


VRAM requirements at every quantization level

QuantizationVRAM neededFits onQuality vs Q8
Q4_K_S15.9 GBRTX 4080 16GB (tight), RTX 5060 Ti 16GB (tight)Good
Q4_K_M16.8 GBRTX 4080 16GB (very tight), RTX 3090, RTX 4090Good–Very Good
Q5_K_M~19.5 GBRTX 3090 24GB, RTX 4090 24GBVery Good
Q6_K~22.5 GBRTX 3090, RTX 4090, Mac M4 Pro 24GB (tight)Near-lossless
Q8_0~28.6 GBDual 16GB (PCIe split), A100 40GB, Mac M4 Max 36GB+Reference
BF16 (full)~62 GBMulti-GPU or cloud (RunPod A100)Original

Q4_K_M is the practical sweet spot for most home builders: 16.8 GB of VRAM, near-lossless quality for code generation, and roughly half the memory footprint of Q8. If you want more fidelity without moving to a bigger card, Q5_K_M adds ~2.7 GB for a modest quality bump that most users won’t notice in conversation but may matter in multi-step agentic tasks.

For the full 62 GB BF16 run or extended-context experiments beyond 32K at Q8, a cloud GPU is the pragmatic answer. RunPod rents A100 80GB instances starting under $2/hr — worthwhile for occasional fine-tuning or benchmarking sessions where you don’t want to build permanent hardware around an edge case.


What speed to expect by GPU

RTX 5060 Ti 16GB (~$429) — tight but functional

The RTX 5060 Ti 16GB runs Qwen3.6-27B Q4_K_M at approximately 31 tokens/second with a usable context window of about 65K tokens (extendable to 131K with -np 1 in llama.cpp, at the cost of speed). The 16.8 GB Q4_K_M GGUF exceeds the card’s 16 GB VRAM, so the runtime spills the overflow to system RAM — the 31 tok/s figure reflects that real-world overhead.

For short coding tasks and chat, 31 tok/s is fine. For long agentic workflows that need 100K+ token context windows, the card runs out of room. That’s the honest constraint.

If you already have an RTX 5060 Ti, run the model — it works. If you’re buying hardware specifically for this model, 16 GB is too tight a budget for what Qwen3.6-27B does best.

One comparison worth knowing: the companion Qwen3.6-35B-A3B MoE model runs at approximately 98 tok/s on the same 16 GB card with the full 262K context available, because its 3B active parameters fit comfortably. The MoE trades some absolute coding quality for dramatically better speed and context on 16 GB hardware. Both cards run both models — pick based on whether you prioritize coding quality (dense 27B) or speed and context (MoE).

See our RTX 5060 Ti 8GB vs 16GB analysis for more on why the 16GB version matters specifically for local LLM work.

RTX 3090 24GB (~$1,050 used) — the value home

The RTX 3090 runs Qwen3.6-27B Q4_K_M at approximately 40 tokens/second as a baseline, with 24 GB of VRAM giving you comfortable headroom: no spill to system RAM, and enough room to run Q5_K_M at ~19.5 GB if you want the extra fidelity.

That 40 tok/s baseline is for straightforward llama.cpp inference. With speculative decoding using a Qwen3-0.6B draft model, community setups have reported 78–85 tok/s on the 3090 — a roughly 2× lift for repetitive code patterns where the draft model can predict tokens accurately. The gain is real but config-sensitive; plan on 40 tok/s as your floor and treat the higher numbers as an optimization goal.

The 3090’s 936 GB/s memory bandwidth remains competitive — it’s faster than the RTX 5060 Ti 16GB (448 GB/s) and only 7% behind the RTX 4090 (1,008 GB/s). For LLM inference where bandwidth dictates speed, the 3090 is still relevant hardware five years after launch.

The tradeoffs: it’s a 350W TDP card, it’s used hardware (mining-card concerns apply to the cheapest listings), and at ~$1,050 it no longer looks like the obvious steal it was in 2024. Our used RTX 3090 deep-dive has the full risk assessment and eBay inspection checklist.

RTX 4090 24GB (~$1,650) — speed-first

The RTX 4090 runs Qwen3.6-27B Q4_K_M at approximately 70 tokens/second in typical llama.cpp use. With speculative decoding enabled, community members have reported 154 tok/s on optimized configs — that’s fast enough to feel like typing at the model rather than waiting for it.

The 1,008 GB/s memory bandwidth and 16,384 CUDA cores make it the fastest single-GPU consumer option for this model. The cost is $1,650 used — you’re paying a 57% premium over the RTX 3090 for roughly 1.75× the speed at the same Q4_K_M quantization.

If your primary use case is interactive coding assistance where latency matters — you’re at the keyboard, waiting for responses — the 4090 is the better workstation card. If you’re running batch inference, fine-tuning overnight, or mostly care about model quality over response latency, the 3090 gives you the same model class for $600 less.

Mac Silicon — Apple’s different math

The Mac Mini M4 Pro with 24 GB unified memory runs Qwen3.6-27B Q4_K_M, but it’s tight: macOS reserves approximately 3.5 GB at idle, leaving ~20.5 GB for the model. This means you’ll need to close Chrome, Docker, and other memory-hungry apps before loading the model, and context windows will be constrained.

If you want Mac Silicon to shine with this model, the M4 Max at 36 GB or 48 GB is the right platform. MLX uses roughly 10% less memory than GGUF on Apple hardware and runs 15–30% faster at the same quantization — but even with those savings, 24 GB is tight for full-context Qwen3.6-27B use.

Our Mac Mini M4 Pro local AI review covers the full MLX setup and what the 24 GB constraint means across different model sizes.


How to run it: Ollama vs llama.cpp

Ollama is the fastest path for most users:

ollama pull qwen3:27b
ollama run qwen3:27b

Critical: Ollama’s default context window is 2,048 tokens. For any serious coding use, override it immediately:

ollama run qwen3:27b --num-ctx 32768

Or set it persistently in a Modelfile and ollama create. Without this, the model will truncate mid-task on anything longer than a short prompt.

Important Ollama caveat: the vision (multimodal) path is broken for Qwen3.6-27B in Ollama. The model ships its vision projector as a separate file that Ollama’s GGUF pipeline doesn’t wire up. Text generation works fine; if you need image input, use llama.cpp or MLX-VLM instead.

llama.cpp gives you more control, especially for speculative decoding:

llama-server \
  -m Qwen3.6-27B-Q4_K_M.gguf \
  --ctx-size 65536 \
  --n-gpu-layers 99 \
  --chat-template-kwargs '{"enable_thinking":false}'

The --chat-template-kwargs flag controls thinking mode (below). For speculative decoding, pair the 27B target with a small Qwen3 draft model for 1.5–2× throughput on repetitive code patterns.

For multi-user serving at higher concurrency, vLLM’s NVFP4 quantization and PagedAttention show 3–4× better throughput than llama.cpp at parallel load. See our vLLM vs Ollama breakdown for when each framework wins.


Thinking mode: when to use it, when to turn it off

Qwen3.6-27B is a hybrid thinking model — it can reason step-by-step through a problem (thinking mode) or respond directly (non-thinking mode). Both modes are in the same checkpoint; you switch at inference time.

Enable thinking for:

  • Multi-step algorithmic problems
  • Debugging sessions where you want the model to work through the logic
  • Math-heavy tasks (GPQA Diamond: 87.8 is with thinking enabled)

Disable thinking for:

  • Quick code completions and one-shot answers
  • Conversational use where the reasoning tokens add latency without value
  • Agent loops where you need fast, deterministic responses

Disable it in llama.cpp with:

--chat-template-kwargs '{"enable_thinking":false}'

Or in a prompt by appending /no_think to your message. Thinking is enabled by default. For most coding-assist workflows, disabling it and only enabling it for hard problems is the right balance — you’ll cut average response latency by 30–50% on straightforward tasks.


Dense vs MoE: which Qwen3.6 variant for your hardware

The Qwen3.6 series ships two distinct models:

Qwen3.6-27B (this article)Qwen3.6-35B-A3B
ArchitectureDense (27B active)MoE (3B active / 35B total)
Q4_K_M VRAM16.8 GB~3.4 GB active (fits 16 GB easily)
Speed (RTX 5060 Ti)~31 tok/s~98 tok/s
Context at full speed~65K on 16 GB262K on 16 GB
Coding qualityHigher (77.2 SWE-bench)Slightly lower
Best forQuality-first codingSpeed + long context

For home builders on 16 GB: the MoE wins on daily-driver use because 98 tok/s with full 262K context is a qualitatively different experience than 31 tok/s with a 65K limit. The 27B dense model earns its keep on 24 GB hardware where you can run Q5_K_M at full context — that’s where the higher benchmark scores translate to real-world quality you’ll notice.

For deeper context on the original Qwen3 family’s performance across model sizes, see our best local AI models by VRAM guide. For tracking how quantization choices translate to actual output quality, our Q4 vs Q8 quality loss analysis has the numbers.

If you’re using Qwen3.6-27B specifically for AI-assisted coding workflows — integrating it into an IDE plugin or local agentic pipeline — the AI coding tools comparison at aicoderscope.com covers how local models like this one compare to hosted API tools in real coding environments.

For tracking Qwen3.6 and other open-weight models across the FOSS ecosystem, aifoss.dev maintains a running directory of self-hosted AI tools and inference stacks.


Frequently Asked Questions

Can the RTX 5060 Ti 16GB run Qwen3.6-27B? Yes, with constraints. Q4_K_M at 16.8 GB exceeds the card’s 16 GB VRAM, so the runtime offloads a small amount to system RAM, reducing speed to approximately 31 tokens/second. Context windows are limited to roughly 65K tokens. It runs well for short-to-medium coding tasks; for large codebase analysis requiring 100K+ context, you need a 24 GB card.

What’s the difference between Qwen3.6-27B and Qwen3.6-35B-A3B? The 27B is dense — all 27 billion parameters are active on every token. The 35B-A3B is a Mixture-of-Experts model where only ~3 billion parameters activate per token, making it dramatically faster (98 tok/s vs 31 tok/s on an RTX 5060 Ti) and easier to fit on 16 GB VRAM. The 27B has a higher SWE-bench score (77.2) and is the better choice when coding quality is the priority and you have 24+ GB of VRAM.

Does Qwen3.6-27B work with Ollama? Yes for text generation — ollama pull qwen3:27b works. Two caveats: set --num-ctx to at least 32768 (the default of 2048 is too small for real tasks), and vision/image input is broken in Ollama because the vision projector ships as a separate file. For vision capability, use llama.cpp or MLX-VLM directly.

How fast is Qwen3.6-27B on a Mac Mini M4 Pro? The 24 GB M4 Pro is marginal: macOS reserves ~3.5 GB at idle, leaving ~20.5 GB usable. Q4_K_M at 16.8 GB fits, but you need to close memory-hungry apps and context windows will be limited. Expect similar or slightly lower throughput than an RTX 3090 (~35–40 tok/s) due to MLX’s memory efficiency, but with the practical constraint of less usable headroom. The M4 Max at 36 GB+ is the comfortable Mac platform for this model.

Should I use thinking mode for coding tasks? Disable it for most coding work. Thinking mode adds reasoning tokens before the response, which is valuable for hard algorithmic problems but adds unnecessary latency for routine completions and one-shot questions. Use /no_think in your prompt or --chat-template-kwargs '{"enable_thinking":false}' in llama.cpp as your default, and enable it selectively for complex debugging or math-heavy tasks.


Sources

Last updated May 29, 2026. GPU prices and used-market rates change weekly; verify current listings before purchasing.


Was this article helpful?