Qwen 3.7-Max for Local AI in 2026: What VRAM You'll Need When the Open Weights Drop

qwenlocal-llmvrammoehardwaregpu

TL;DR: Qwen 3.7-Max launched May 19 as a closed-weight API model scoring 80.4% on SWE-Verified — effectively tied with Claude Opus 4.6. Open weights (expected as a 27B dense and a 35B-A3B MoE variant) are anticipated mid-to-late June 2026 based on Alibaba’s 3–4 week release cadence. A 24 GB GPU handles both formats at Q4_K_M today using the Qwen 3.6 generation, and the 3.7 open weights will land in the same VRAM tier.

RTX 3090 24GB (used)RTX 4090 24GBMac Studio M4 Max 96GB
Best for35B-A3B MoE or 27B dense at Q4Same + ~2x faster tok/sQ6_K quality, silent operation
Price~$600–$800 used (Jun 2026)~$1,600 new$3,999
The catchTight VRAM at 128K+ contextOverkill for this tierARM ecosystem friction

Honest take: Buy a used RTX 3090 now, run Qwen 3.6-27B today, and swap to 3.7 open weights the day they land — identical hardware requirement, one ollama pull away.

What Qwen 3.7-Max Actually Is

Alibaba announced Qwen 3.7-Max on May 19, 2026. It’s a proprietary Mixture-of-Experts model — Alibaba hasn’t disclosed the parameter count, but third-party analysis from Artificial Analysis estimates it in the same range as the Qwen 3.6-Max-Preview, approximately 1 trillion or more total parameters. Context window: 1 million tokens. API pricing on OpenRouter: $1.25/M input tokens, $3.75/M output tokens.

“Max” in Alibaba’s naming means closed-weights flagship. Every previous Qwen generation has followed the same pattern: API-only Max drops first, open-weight smaller variants come 3–4 weeks later. For home-lab purposes, that’s the entire story right now — you cannot run Max locally. You’re waiting on the open models.

What you can run today: Qwen 3.6-27B (dense) and Qwen 3.6-35B-A3B (MoE), both available on Hugging Face. These are the direct predecessors and share the hardware profile of whatever Qwen 3.7 open weights Alibaba releases.

Benchmark Context: Where 3.7-Max Sits

Qwen 3.7-Max scores 80.4% on SWE-Verified, essentially tied with Claude Opus 4.6 (80.8%) and slightly ahead of DeepSeek V4 Pro Max (80.6%). On the harder SWE-Pro benchmark, it reaches 60.6% — ahead of Kimi K2.6 Thinking (59.5%) and DeepSeek V4 Pro Max (59.0%). On Terminal Bench 2.0-Terminus: 69.7% vs DeepSeek V4 Pro Max at 67.9%.

That’s frontier-class coding performance. For home-lab planning, the relevant implication is that open-weight variants will trade some of that capability for runability. Based on the Qwen 3.6 precedent, open-weight models typically land 8–15 percentage points below the Max flagship on agentic benchmarks. Still competitive for offline coding, document work, and personal agents.

Worth noting: Qwen 3.7-Plus (the mid-tier API model) matches 3.7-Max on AIME mathematics benchmarks while running 3x faster at lower cost. The open weights will be more Plus-tier than Max-tier in real-world quality — useful context before building hardware expectations around the 80.4% SWE number.

The Open-Weight Timeline

Alibaba has been consistent enough that you can set a rough calendar:

ReleaseAPI to Open WeightsGap
Qwen 3.5API Jan 2026 → Open Feb 2026~3 weeks
Qwen 3.6API early Apr 2026 → Open Apr 16, 2026~3.5 weeks
Qwen 3.7API May 19 → Estimated mid-Jun 2026~3–4 weeks

The QwenLM GitHub is the leading indicator — the community watches it for new repository pushes before any blog announcement. No open-weight Qwen 3.7 repository existed as of June 7, 2026, but the window is open.

Expected open-weight variants, extrapolated from the Qwen 3.6 release structure:

  • Qwen3.7-27B — dense model, same architecture class as Qwen3.6-27B
  • Qwen3.7-35B-A3B — MoE, 35B total parameters / 3B active parameters per token

A 72B dense variant is not expected in the first wave. Alibaba has consistently shipped the 27B dense and 35B MoE as the home-lab-viable tier before releasing larger sizes.

Hardware Reality: VRAM Requirements

Since the 3.7 open weights aren’t available yet, Qwen 3.6 is the most accurate proxy. Both generations share the same architectural lineage (hybrid attention, MoE routing structure), and Alibaba has not made architectural changes that would significantly shift the VRAM footprint between 3.6 and 3.7 at equivalent sizes.

These are measured VRAM numbers from community llama.cpp GGUF testing:

Qwen3.6-27B Dense — Proxy for Qwen3.7-27B

QuantizationVRAM RequiredMin GPU
Q4_K_M~16.8 GBRTX 3090/4090 24GB
Q5_K_M~19.5 GB24 GB required
Q6_K~22.5 GB24 GB (tight on 3090)
Q8_0~28.6 GBDual 16 GB or Mac unified

Qwen3.6-35B-A3B MoE — Proxy for Qwen3.7-35B-A3B

QuantizationVRAM RequiredMin GPU
Q4_K_M~21 GBRTX 3090/4090 24GB
Q5_K_M~24.5 GBRTX 4090 (tight) or dual-GPU
Q8_0~43 GBDual RTX 3090 or workstation card

For a breakdown of what quality degradation actually looks like at each quantization level, see the Q4 vs Q5 vs Q6 vs Q8 quality comparison.

The 8 GB Wall

At 8 GB VRAM — RTX 4060, RTX 4060 Ti 8GB, or the base RTX 5060 8GB — neither the 27B dense nor the 35B-A3B MoE fits at any practical quantization. Q3_K_M on the 27B requires ~14 GB, still over budget; the smallest model that works at 8 GB would need to be a sub-9B size class. If Alibaba follows the Qwen 3.5 pattern and ships a full family (from 0.6B to 72B), there may be 7B or 9B variants. But those won’t carry the reasoning improvements that make 3.7-Max interesting.

On why the 16GB vs 8GB split matters so much for this generation, see the RTX 5060 Ti 8GB vs 16GB comparison.

Real Token Speed Numbers (Qwen 3.6 Proxy)

These are community benchmark results using llama.cpp GGUFs on consumer hardware — the best predictor of what Qwen 3.7 open weights will deliver at the same VRAM tier:

GPUModelQuanttok/s
RTX 3090 24GBQwen3.6-35B-A3BQ4_K_M55–65
RTX 3090 24GBQwen3.6-27BQ4_K_M~35 baseline / ~74 with DFlash
RTX 4090 24GBQwen3.6-35B-A3BQ4_K_M~122
Mac Studio M4 Max 96GBQwen3.6-27BQ4_K_M~16.6

The MoE architecture advantage matters here. Qwen3.6-35B-A3B activates only 3B parameters per forward pass despite having 35B total parameters — so inference speed looks closer to a 3B dense model. On an RTX 3090 you get ~60 tok/s on the MoE vs ~35 tok/s on the 27B dense, at almost identical VRAM usage.

The RTX 4090 gap is real: ~122 tok/s vs ~60 tok/s on the 3090 for the 35B-A3B MoE. Whether that 2x throughput is worth $800+ extra depends entirely on your use case. For interactive coding agents, 60 tok/s is already fast enough to stay out of your way. For batch jobs processing dozens of long documents, the 4090 starts earning its premium.

Context length effects: At 32K context the VRAM figures hold. Push to 128K and KV cache adds 4–8 GB. Push to the full 1M-token context window (which 3.7-Max supports in the API) and you’re offloading KV cache to system RAM — expect single-digit tok/s regardless of GPU. For home-lab use (coding assistant, document Q&A, local agent), 32K context covers the overwhelming majority of workloads.

Running Qwen 3.6 Now: Your Bridge to 3.7

While the open weights are pending, Qwen 3.6-27B is available on Ollama and runs on the same hardware you’ll use for 3.7. A basic setup:

ollama pull qwen3.6:27b-q4_K_M
ollama run qwen3.6:27b-q4_K_M

Expected on first load on a 24 GB GPU (16.8 GB VRAM allocated, ~7 GB headroom):

pulling manifest ✓
pulling qwen3.6:27b-q4_K_M... ████████████████ 100%
>>> Send a message

If you hit an out-of-memory error:

error: CUDA out of memory. Tried to allocate 2.50 GiB

Two fixes: drop to qwen3.6:27b-q4_0 (~15.5 GB, slightly lower quality), or cap context in your Modelfile:

PARAMETER num_ctx 4096

The VRAM squeeze happens at 64K+ context lengths where KV cache starts competing with model weights for the ~7 GB of headroom on a 24 GB card. For standard chat sessions, the default context is fine.

For a full local coding setup with this model, see the Continue.dev + Ollama stack guide.

The Cost Math: API Now vs Open Weights Later

Qwen 3.7-Max API pricing ($1.25/M input, $3.75/M output via OpenRouter):

Daily usageMonthly API cost
Light: 100K input / 20K output~$2.00/month
Moderate: 500K input / 100K output~$18.75/month
Heavy agentic: 2M input / 400K output~$65.00/month

Light-to-moderate API usage is inexpensive enough that hardware isn’t justified for the API model alone. But the decision isn’t just cost — it’s also quality. The API Max version is materially smarter than the open-weight variants will be. You’re not choosing where to run the same thing; you’re choosing between different capability tiers at different price points.

Three-way decision framework:

Use the API if your volume is moderate (under $30/month) and you need Max-class performance for agentic coding or complex reasoning tasks.

Use RunPod if you want heavy experimentation before committing to hardware. An RTX 3090 on RunPod runs ~$0.44/hour. Two hours/day costs $26.40/month with zero upfront capital and you can spin up a 4090 when you need the extra throughput.

Buy a used RTX 3090 if you’re already running other models locally — the marginal cost to add Qwen 3.7 open weights is $0, and you get Qwen 3.6-27B at quality the API light-tier users are paying for anyway. Full breakeven math in the RunPod vs local GPU guide.

How Qwen 3.7 Open Weights Fit the 24 GB Landscape

For context on where the expected 3.7 open weights land against current competition at the 24 GB GPU tier:

ModelVRAM at Q4_K_M~tok/s (RTX 3090)Status
Qwen3.7-27B (expected)~17 GB~35–40Not yet released
Qwen3.7-35B-A3B (expected)~21 GB~55–65Not yet released
Qwen3.6-27B16.8 GB~35Available now
Qwen3.6-35B-A3B~21 GB~60Available now
DeepSeek V4 Pro distill 28B~18 GB~45Available now

The Qwen3.7 open weights will likely be incremental improvements over 3.6 at similar hardware cost — worth waiting for if you’re buying hardware specifically for this generation, not worth waiting for if you already have a 24 GB GPU and want to start now. For a deeper comparison on how Qwen and DeepSeek compete at this tier, see the DeepSeek V4 vs Qwen3 guide.

For coding-focused local AI specifically, the aicoderscope.com guide on local AI coding stacks covers how these models perform in real IDE workflows.

FAQ

Can I run Qwen 3.7-Max itself locally? No. The Max variant is closed-weight, API-only. You’re waiting on the open-weight variants (expected Qwen3.7-27B and Qwen3.7-35B-A3B), which had no announced release date as of June 7, 2026, but are expected mid-to-late June based on Alibaba’s 3–4 week cadence from the May 19 API launch.

What GPU do I need for the 35B-A3B MoE variant? 24 GB VRAM at Q4_K_M — RTX 3090 or RTX 4090. The model weights take ~21 GB, leaving ~3 GB for KV cache at default context. For 64K+ context sessions, you need more headroom; an RTX 4090 gives you slightly better margins than a 3090 at the same price tier.

Is the 27B dense or 35B-A3B MoE better for local use? Depends on the workload. The 27B dense has higher quality-per-token because all attention is full, not sparse MoE routing. The 35B-A3B runs ~1.7x faster on the same GPU for almost identical VRAM. For interactive chat and coding assistance, the MoE speed win is noticeable. For batch processing where quality matters more than latency, the dense model is the better pick.

Will 16 GB VRAM work? For the 27B dense at Q4_K_M (~16.8 GB), a 16 GB card has no room for KV cache after loading model weights — effectively unusable. Q3_K_M on the 27B needs ~13.5 GB and technically fits, but quality degradation at Q3 is significant on a reasoning-focused model. The safe minimum is 24 GB.

Is now the right time to buy hardware for this? If you’re already running other models (Qwen 3.6, DeepSeek distills, Llama 3.3 70B splits), a used RTX 3090 is the right tool and the price point hasn’t changed. If you’re buying specifically for Qwen 3.7, it’s rational to wait 2–3 weeks for the open weights to confirm they deliver expected quality before pulling the trigger on hardware. See the GPU buying guide for current used market pricing.

What if no 27B/35B open weights land — only a 72B? A 72B dense needs ~42 GB at Q4_K_M, which spills over the 24 GB GPU tier into dual-card or unified memory territory. Possible but not the expected scenario based on how Alibaba has structured every recent Qwen release. The 27B/35B range has been the consistent home-lab tier across 3.5 and 3.6 both.

Sources

Last updated June 7, 2026. Prices and specs change; verify current rates before purchasing.

  • RTX 3090 24GB — the value sweet spot for 27B dense and 35B MoE at Q4_K_M
  • RTX 4090 24GB — same VRAM tier, ~2x throughput for high-volume use

Was this article helpful?