Qwen3.6-27B for Local AI in 2026: Which GPU Runs It and What Speed to Expect
TL;DR: Qwen3.6-27B is a dense 27B model released April 22, 2026 that scores 77.2 on SWE-bench Verified — beating Alibaba’s own 397B MoE on coding benchmarks. At Q4_K_M it needs 16.8 GB of VRAM. An RTX 3090 runs it at ~40 tok/s; an RTX 4090 reaches ~70 tok/s. The 16 GB cards (RTX 5060 Ti, RTX 4080) can run it, but at constrained context windows.
| RTX 5060 Ti 16GB | RTX 3090 24GB | RTX 4090 24GB | |
|---|---|---|---|
| Best for | Entry / budget-first | AI value workstation | Speed-first workflow |
| Price (May 2026) | ~$429 new | ~$1,050 used | ~$1,650 |
| Q4_K_M headroom | Tight (at VRAM limit) | Comfortable | Comfortable |
| Approx. tok/s Q4_K_M | ~31 | ~40 | ~70 |
| Usable context | ~65K tokens | ~125K tokens | ~125K tokens |
| The catch | Context limited, no slack | 5-yr-old arch | High cost |
Honest take: The RTX 3090 is the value pick. It runs Q5_K_M with full context, costs $600 less than the 4090, and produces the same model class output. If you already own a 16 GB card, the 27B dense model is also viable — just keep your context windows short.
What makes Qwen3.6-27B different
Alibaba’s Qwen team released Qwen3.6-27B on April 22, 2026 under an Apache 2.0 license. The notable claim: a 27-billion-parameter dense model that beats the prior-generation Qwen3.5-397B-A17B — a 397B MoE with 17B active parameters — on every major agentic coding benchmark.
The benchmark scores that landed it on Hacker News:
| Benchmark | Qwen3.6-27B | Qwen3.5-397B-A17B (prior flagship) |
|---|---|---|
| SWE-bench Verified | 77.2 | 76.2 |
| SWE-bench Pro | 53.5 | — |
| Terminal-Bench 2.0 | 59.3 | — |
| GPQA Diamond | 87.8 | — |
| QwenWebBench | 1487 | 1068 (Qwen3.5-27B) |
SWE-bench Verified is the closest thing the ML community has to a real-world coding exam — it asks a model to write a patch that fixes an actual GitHub issue in a real repository. A score of 77.2 puts this 27B model in the same range as hosted frontier models, not just “good for its size.”
The architecture is hybrid: 64 layers alternating Gated DeltaNet blocks and standard attention, with a multi-token prediction head that enables speculative decoding. Native context is 262,144 tokens, extensible to ~1,010,000 via YaRN. It’s multimodal (vision-language), though the vision path has limitations in Ollama — more on that below.
VRAM requirements at every quantization level
| Quantization | VRAM needed | Fits on | Quality vs Q8 |
|---|---|---|---|
| Q4_K_S | 15.9 GB | RTX 4080 16GB (tight), RTX 5060 Ti 16GB (tight) | Good |
| Q4_K_M | 16.8 GB | RTX 4080 16GB (very tight), RTX 3090, RTX 4090 | Good–Very Good |
| Q5_K_M | ~19.5 GB | RTX 3090 24GB, RTX 4090 24GB | Very Good |
| Q6_K | ~22.5 GB | RTX 3090, RTX 4090, Mac M4 Pro 24GB (tight) | Near-lossless |
| Q8_0 | ~28.6 GB | Dual 16GB (PCIe split), A100 40GB, Mac M4 Max 36GB+ | Reference |
| BF16 (full) | ~62 GB | Multi-GPU or cloud (RunPod A100) | Original |
Q4_K_M is the practical sweet spot for most home builders: 16.8 GB of VRAM, near-lossless quality for code generation, and roughly half the memory footprint of Q8. If you want more fidelity without moving to a bigger card, Q5_K_M adds ~2.7 GB for a modest quality bump that most users won’t notice in conversation but may matter in multi-step agentic tasks.
For the full 62 GB BF16 run or extended-context experiments beyond 32K at Q8, a cloud GPU is the pragmatic answer. RunPod rents A100 80GB instances starting under $2/hr — worthwhile for occasional fine-tuning or benchmarking sessions where you don’t want to build permanent hardware around an edge case.
What speed to expect by GPU
RTX 5060 Ti 16GB (~$429) — tight but functional
The RTX 5060 Ti 16GB runs Qwen3.6-27B Q4_K_M at approximately 31 tokens/second with a usable context window of about 65K tokens (extendable to 131K with -np 1 in llama.cpp, at the cost of speed). The 16.8 GB Q4_K_M GGUF exceeds the card’s 16 GB VRAM, so the runtime spills the overflow to system RAM — the 31 tok/s figure reflects that real-world overhead.
For short coding tasks and chat, 31 tok/s is fine. For long agentic workflows that need 100K+ token context windows, the card runs out of room. That’s the honest constraint.
If you already have an RTX 5060 Ti, run the model — it works. If you’re buying hardware specifically for this model, 16 GB is too tight a budget for what Qwen3.6-27B does best.
One comparison worth knowing: the companion Qwen3.6-35B-A3B MoE model runs at approximately 98 tok/s on the same 16 GB card with the full 262K context available, because its 3B active parameters fit comfortably. The MoE trades some absolute coding quality for dramatically better speed and context on 16 GB hardware. Both cards run both models — pick based on whether you prioritize coding quality (dense 27B) or speed and context (MoE).
See our RTX 5060 Ti 8GB vs 16GB analysis for more on why the 16GB version matters specifically for local LLM work.
RTX 3090 24GB (~$1,050 used) — the value home
The RTX 3090 runs Qwen3.6-27B Q4_K_M at approximately 40 tokens/second as a baseline, with 24 GB of VRAM giving you comfortable headroom: no spill to system RAM, and enough room to run Q5_K_M at ~19.5 GB if you want the extra fidelity.
That 40 tok/s baseline is for straightforward llama.cpp inference. With speculative decoding using a Qwen3-0.6B draft model, community setups have reported 78–85 tok/s on the 3090 — a roughly 2× lift for repetitive code patterns where the draft model can predict tokens accurately. The gain is real but config-sensitive; plan on 40 tok/s as your floor and treat the higher numbers as an optimization goal.
The 3090’s 936 GB/s memory bandwidth remains competitive — it’s faster than the RTX 5060 Ti 16GB (448 GB/s) and only 7% behind the RTX 4090 (1,008 GB/s). For LLM inference where bandwidth dictates speed, the 3090 is still relevant hardware five years after launch.
The tradeoffs: it’s a 350W TDP card, it’s used hardware (mining-card concerns apply to the cheapest listings), and at ~$1,050 it no longer looks like the obvious steal it was in 2024. Our used RTX 3090 deep-dive has the full risk assessment and eBay inspection checklist.
RTX 4090 24GB (~$1,650) — speed-first
The RTX 4090 runs Qwen3.6-27B Q4_K_M at approximately 70 tokens/second in typical llama.cpp use. With speculative decoding enabled, community members have reported 154 tok/s on optimized configs — that’s fast enough to feel like typing at the model rather than waiting for it.
The 1,008 GB/s memory bandwidth and 16,384 CUDA cores make it the fastest single-GPU consumer option for this model. The cost is $1,650 used — you’re paying a 57% premium over the RTX 3090 for roughly 1.75× the speed at the same Q4_K_M quantization.
If your primary use case is interactive coding assistance where latency matters — you’re at the keyboard, waiting for responses — the 4090 is the better workstation card. If you’re running batch inference, fine-tuning overnight, or mostly care about model quality over response latency, the 3090 gives you the same model class for $600 less.
Mac Silicon — Apple’s different math
The Mac Mini M4 Pro with 24 GB unified memory runs Qwen3.6-27B Q4_K_M, but it’s tight: macOS reserves approximately 3.5 GB at idle, leaving ~20.5 GB for the model. This means you’ll need to close Chrome, Docker, and other memory-hungry apps before loading the model, and context windows will be constrained.
If you want Mac Silicon to shine with this model, the M4 Max at 36 GB or 48 GB is the right platform. MLX uses roughly 10% less memory than GGUF on Apple hardware and runs 15–30% faster at the same quantization — but even with those savings, 24 GB is tight for full-context Qwen3.6-27B use.
Our Mac Mini M4 Pro local AI review covers the full MLX setup and what the 24 GB constraint means across different model sizes.
How to run it: Ollama vs llama.cpp
Ollama is the fastest path for most users:
ollama pull qwen3:27b
ollama run qwen3:27b
Critical: Ollama’s default context window is 2,048 tokens. For any serious coding use, override it immediately:
ollama run qwen3:27b --num-ctx 32768
Or set it persistently in a Modelfile and ollama create. Without this, the model will truncate mid-task on anything longer than a short prompt.
Important Ollama caveat: the vision (multimodal) path is broken for Qwen3.6-27B in Ollama. The model ships its vision projector as a separate file that Ollama’s GGUF pipeline doesn’t wire up. Text generation works fine; if you need image input, use llama.cpp or MLX-VLM instead.
llama.cpp gives you more control, especially for speculative decoding:
llama-server \
-m Qwen3.6-27B-Q4_K_M.gguf \
--ctx-size 65536 \
--n-gpu-layers 99 \
--chat-template-kwargs '{"enable_thinking":false}'
The --chat-template-kwargs flag controls thinking mode (below). For speculative decoding, pair the 27B target with a small Qwen3 draft model for 1.5–2× throughput on repetitive code patterns.
For multi-user serving at higher concurrency, vLLM’s NVFP4 quantization and PagedAttention show 3–4× better throughput than llama.cpp at parallel load. See our vLLM vs Ollama breakdown for when each framework wins.
Thinking mode: when to use it, when to turn it off
Qwen3.6-27B is a hybrid thinking model — it can reason step-by-step through a problem (thinking mode) or respond directly (non-thinking mode). Both modes are in the same checkpoint; you switch at inference time.
Enable thinking for:
- Multi-step algorithmic problems
- Debugging sessions where you want the model to work through the logic
- Math-heavy tasks (GPQA Diamond: 87.8 is with thinking enabled)
Disable thinking for:
- Quick code completions and one-shot answers
- Conversational use where the reasoning tokens add latency without value
- Agent loops where you need fast, deterministic responses
Disable it in llama.cpp with:
--chat-template-kwargs '{"enable_thinking":false}'
Or in a prompt by appending /no_think to your message. Thinking is enabled by default. For most coding-assist workflows, disabling it and only enabling it for hard problems is the right balance — you’ll cut average response latency by 30–50% on straightforward tasks.
Dense vs MoE: which Qwen3.6 variant for your hardware
The Qwen3.6 series ships two distinct models:
| Qwen3.6-27B (this article) | Qwen3.6-35B-A3B | |
|---|---|---|
| Architecture | Dense (27B active) | MoE (3B active / 35B total) |
| Q4_K_M VRAM | 16.8 GB | ~3.4 GB active (fits 16 GB easily) |
| Speed (RTX 5060 Ti) | ~31 tok/s | ~98 tok/s |
| Context at full speed | ~65K on 16 GB | 262K on 16 GB |
| Coding quality | Higher (77.2 SWE-bench) | Slightly lower |
| Best for | Quality-first coding | Speed + long context |
For home builders on 16 GB: the MoE wins on daily-driver use because 98 tok/s with full 262K context is a qualitatively different experience than 31 tok/s with a 65K limit. The 27B dense model earns its keep on 24 GB hardware where you can run Q5_K_M at full context — that’s where the higher benchmark scores translate to real-world quality you’ll notice.
For deeper context on the original Qwen3 family’s performance across model sizes, see our best local AI models by VRAM guide. For tracking how quantization choices translate to actual output quality, our Q4 vs Q8 quality loss analysis has the numbers.
If you’re using Qwen3.6-27B specifically for AI-assisted coding workflows — integrating it into an IDE plugin or local agentic pipeline — the AI coding tools comparison at aicoderscope.com covers how local models like this one compare to hosted API tools in real coding environments.
For tracking Qwen3.6 and other open-weight models across the FOSS ecosystem, aifoss.dev maintains a running directory of self-hosted AI tools and inference stacks.
Frequently Asked Questions
Can the RTX 5060 Ti 16GB run Qwen3.6-27B? Yes, with constraints. Q4_K_M at 16.8 GB exceeds the card’s 16 GB VRAM, so the runtime offloads a small amount to system RAM, reducing speed to approximately 31 tokens/second. Context windows are limited to roughly 65K tokens. It runs well for short-to-medium coding tasks; for large codebase analysis requiring 100K+ context, you need a 24 GB card.
What’s the difference between Qwen3.6-27B and Qwen3.6-35B-A3B? The 27B is dense — all 27 billion parameters are active on every token. The 35B-A3B is a Mixture-of-Experts model where only ~3 billion parameters activate per token, making it dramatically faster (98 tok/s vs 31 tok/s on an RTX 5060 Ti) and easier to fit on 16 GB VRAM. The 27B has a higher SWE-bench score (77.2) and is the better choice when coding quality is the priority and you have 24+ GB of VRAM.
Does Qwen3.6-27B work with Ollama?
Yes for text generation — ollama pull qwen3:27b works. Two caveats: set --num-ctx to at least 32768 (the default of 2048 is too small for real tasks), and vision/image input is broken in Ollama because the vision projector ships as a separate file. For vision capability, use llama.cpp or MLX-VLM directly.
How fast is Qwen3.6-27B on a Mac Mini M4 Pro? The 24 GB M4 Pro is marginal: macOS reserves ~3.5 GB at idle, leaving ~20.5 GB usable. Q4_K_M at 16.8 GB fits, but you need to close memory-hungry apps and context windows will be limited. Expect similar or slightly lower throughput than an RTX 3090 (~35–40 tok/s) due to MLX’s memory efficiency, but with the practical constraint of less usable headroom. The M4 Max at 36 GB+ is the comfortable Mac platform for this model.
Should I use thinking mode for coding tasks?
Disable it for most coding work. Thinking mode adds reasoning tokens before the response, which is valuable for hard algorithmic problems but adds unnecessary latency for routine completions and one-shot questions. Use /no_think in your prompt or --chat-template-kwargs '{"enable_thinking":false}' in llama.cpp as your default, and enable it selectively for complex debugging or math-heavy tasks.
Sources
- Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model — Qwen Blog
- Alibaba Qwen Team Releases Qwen3.6-27B — MarkTechPost
- unsloth/Qwen3.6-27B-GGUF file sizes — Hugging Face
- Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model — Hacker News
- Qwen 3.6 27B Dense on a 5060 Ti: Speculative Decoding and Why the MoE Still Wins — njannasch.dev
- An Overnight Stack for Qwen3.6–27B: 85 TPS, 125K Context on One RTX 3090 — Medium
- I got 154 tok/s from a Single RTX 4090 running a 27B Model — Medium / Coding Nexus
- Qwen3.6 — How to Run Locally — Unsloth Documentation
- Qwen 3.6 27B VRAM & Hardware Requirements — Will It Run AI Blog
- Qwen 3.6 27B Model Hits 40 Tokens/s on RTX 3090 — Phemex News
- RTX 3090 Price Tracker May 2026 — Best Value GPU
Last updated May 29, 2026. GPU prices and used-market rates change weekly; verify current listings before purchasing.
Recommended Gear
- NVIDIA GeForce RTX 5060 Ti 16GB
- NVIDIA GeForce RTX 3090 24GB
- NVIDIA GeForce RTX 4090
- Apple Mac Mini M4 Pro
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →