DeepSeek V4 vs Qwen3 for Local AI in 2026: Which Model Family Fits Your GPU?

local-llmdeepseekqwen3gpuvrammodel-comparisonhardware-guide

TL;DR: DeepSeek V4 Flash and Qwen3 both landed in late April 2026 and rewrote the open-weights leaderboards. But for home-lab inference, they serve completely different audiences: Qwen3’s MoE variants run on a single consumer GPU and hit 120 tok/s on an RTX 3090, while V4 Flash’s lightest usable quantization requires 103 GB of VRAM — more than four RTX 4090s combined. The model family that “fits your GPU” is almost certainly Qwen3.

Qwen3 small/mid (8B–32B)Qwen3 MoE (30B-A3B / 35B-A3B)DeepSeek V4 Flash (284B)
Best forBudget single-GPU buildsConsumer GPU sweet spotMulti-GPU server or API
Min VRAM5 GB (8B Q4)~17 GB (Q4)~103 GB (Q2_K)
Single RTX 3090 speed23–60 tok/s (14B–32B)~120 tok/s (Q3, Unsloth)Not viable
The catchSmaller models, less reasoning depthNeeds 24 GB for Q4 headroomNeeds $10k+ hardware to run well

Honest take: For 99% of home-lab builders, Qwen3 30B-A3B or 35B-A3B is the answer regardless of budget. DeepSeek V4 Flash is an excellent API model — use it that way, at $0.10/M input tokens, rather than trying to run it locally unless you have a purpose-built multi-GPU server.

What shipped and when

DeepSeek-V4 launched April 24, 2026, in two variants:

  • V4 Flash: 284B total parameters, 13B activated per token (MoE)
  • V4 Pro: 1.6T total parameters, 49B activated per token (MoE)

Alibaba’s Qwen team shipped the Qwen3 family days later, spanning 0.6B through 235B, including two MoE variants designed explicitly for consumer hardware: the 30B-A3B (3B active/token) and the newer 35B-A3B from the Qwen3.6 update.

Both families use Mixture-of-Experts architectures that activate only a fraction of parameters per token, which makes them faster to serve than dense models of equivalent total size. The key difference is how aggressively each family scaled the MoE — and who they optimized for.

The VRAM numbers, tier by tier

Consumer single-GPU: 8–24 GB VRAM

Qwen3 owns this tier entirely.

ModelQuantizationVRAM requiredMinimum GPU
Qwen3 8BQ4_K_M~5 GBRTX 4060 8GB
Qwen3 14BQ4_K_M~8 GBRTX 3060 12GB
Qwen3 30B-A3BQ4_K_M~16.8 GBRTX 3090 24GB
Qwen3.6-35B-A3BQ3 (Unsloth)~23 GBRTX 3090 24GB
Qwen3 32BQ4_K_M~20 GBRTX 3090 (tight)

DeepSeek V4 Flash doesn’t appear in this table because it cannot run in this tier. Its IQ1_S-XL quantization — the most aggressive lossy compression available — compresses the 284B model from its FP8 footprint down to 57.3 GB. That’s still more than two RTX 4090 GPUs combined. The first viable quantization for practical use is Q2_K at 103 GB.

If you have a single GPU with 24 GB or less, V4 Flash is a non-starter. Any attempt at local inference involves offloading almost everything to system RAM, with generation speeds under 2 tok/s — impractical for interactive use.

Prosumer multi-GPU: 2–4× consumer cards

Two RTX 4090s in PCIe tensor-parallel gives 48 GB of pooled VRAM. That’s still 55 GB short of V4 Flash’s Q2_K threshold.

The workaround is mixed VRAM/CPU offloading: load the attention layers in VRAM, offload the MoE expert weights to system RAM using llama.cpp’s -cmoe -ub 128 flags. On a 4× RTX 4090 rig (96 GB total VRAM, plus 128 GB system RAM), community tests put V4 Flash throughput at 8–12 tok/s. Usable for batch processing; painful for interactive chat.

Qwen3 235B-A22B at Q4 needs approximately 132 GB, putting it into similar territory — viable with partial offloading but not fast.

At the 2–4× consumer GPU tier, the better answer remains Qwen3’s MoE variants running on a single card, with spare GPUs handling separate workloads or parallel inference instances.

Professional multi-GPU: dedicated server hardware

This is where V4 Flash becomes genuinely useful. Running on dual NVIDIA RTX PRO 6000 Max-Q GPUs with W4A16+FP8 quantization and MTP self-speculation, V4 Flash hits approximately 111 tok/s for single-stream requests at 128K context. That figure comes from NVIDIA’s own testing and represents the current high-water mark for affordable (relative to H100 clusters) V4 Flash inference.

On a single H100 80GB with GPU offloading enabled, throughput is around 20 tok/s — better, but H100s cost $25,000+ used. The official V4 Flash API, measured by Artificial Analysis, runs at approximately 84 tok/s.

Qwen3 235B-A22B needs a minimum of 4× H100 80GB GPUs (320 GB total) to run Q4 with practical context and KV cache headroom — an eight-figure hardware investment for most people. The 30B-A3B and 35B-A3B MoE variants, by contrast, run on hardware anyone reading this article can actually buy.

Speed on consumer hardware: what you’ll actually see

These figures are from community benchmarks tested in April–May 2026 using llama.cpp and Ollama.

GPUModelQuantSpeed
RTX 3060 12GBQwen3 8BQ4_K_M~42 tok/s
RTX 3060 12GBQwen3 14BQ4_K_M~23–29 tok/s
RTX 4070Qwen3 14BQ4_K_M~60 tok/s
RTX 3090 24GBQwen3.6-35B-A3BQ3 (Unsloth)~120 tok/s
RTX 3090 24GBDeepSeek V4 FlashAnyNot viable (103 GB min)
96 GB pooled + CPU offloadDeepSeek V4 FlashQ2_K8–12 tok/s
Dual RTX PRO 6000DeepSeek V4 FlashW4A16+FP8~111 tok/s
H100 80GB (single)DeepSeek V4 FlashFP8 + offload~20 tok/s

The Qwen3.6-35B-A3B number deserves attention: 120 tok/s on a single RTX 3090 with Unsloth’s Q3 GGUF, at 23 GB VRAM. That’s faster than real-time reading speed, from a 35B-parameter model, on a GPU that costs under $600 used. The model achieves this via its extreme MoE sparsity: 35B total parameters but only 3B active per token, so the compute per token is closer to a 3B dense model while retaining the quality of a much larger network.

For the original Qwen3 30B-A3B (slightly smaller than the 35B update), the Q4_K_M quantization fits in approximately 16.8 GB — giving comfortable headroom on a 24 GB card for KV cache. The full setup guide for Qwen3 30B-A3B walks through the Ollama and llama.cpp install paths.

Benchmark quality: what you’re trading away

Speed means nothing if the model can’t reason. Here’s where the two families stand on standard benchmarks as of May 2026.

DeepSeek V4 Flash (at or near full precision):

  • AIME 2025: 99.4%
  • MMLU-Pro: 92.8%
  • BenchLM vs Qwen3 235B: 71 vs 33 (provisional)
  • Context window: 1,000,000 tokens

Qwen3 235B-A22B (frontier variant):

  • Context window: 128K tokens
  • Competitive with frontier proprietary models on most tasks
  • Requires 4+ H100s to run

Qwen3.6-35B-A3B (consumer GPU tier):

  • GPQA: 86.0%
  • AIME 2026: 92.7%
  • Coding: competitive with much larger dense models
  • Fits on a single RTX 3090 at Q3

The takeaway: at full or near-full precision, V4 Flash is clearly ahead of anything that runs on consumer hardware. But V4 Flash at Q2_K (the minimum viable local quant) suffers measurable quality loss — community testing suggests 1–3% degradation moving from FP8 to aggressive quantization, and more on harder reasoning chains. The practical question isn’t “which model family is better at ideal conditions?” but “which model family am I actually able to run well?”

For more on how quantization affects reasoning quality at each level, see the Q4 vs Q8 quality loss analysis.

The 1M-token context window: useful or marketing?

V4 Flash supports a 1M-token context window vs Qwen3’s 128K cap. That sounds like a decisive advantage for long-document work.

In practice, on consumer or even prosumer local hardware, 1M context is theoretical. Processing 1M tokens at 8–12 tok/s (the realistic local speed for V4 Flash with partial offloading) takes hours — and that’s just prefill, before any generation. A 128K context at 120 tok/s on a single RTX 3090 with Qwen3 is more practical for the use cases most home-lab users actually have.

If you genuinely need 1M context at usable speed, the right tool is the DeepSeek V4 Flash API, not a local server.

Framework choices matter

For Qwen3 on a single consumer GPU, either Ollama or llama.cpp works well. Ollama is easier to set up and adds an OpenAI-compatible REST API with almost no configuration. llama.cpp direct (llama-server) is 3–10% faster in single-user inference. The vLLM vs Ollama guide covers the tradeoffs in detail.

For V4 Flash on multi-GPU setups, Ollama loses MoE routing efficiency — use vLLM or SGLang instead. vLLM’s official recipe for DeepSeek-V4-Flash is on their site and handles tensor parallel correctly across multiple GPUs. Ollama is fine for quick testing but not production V4 Flash deployment.

For AI coding tool integration — routing local models into editors like Cursor or Windsurf — see the coverage on aicoderscope.com which tracks the coding assistant ecosystem specifically.

The API alternative

If you’re drawn to V4 Flash’s benchmark numbers but don’t have the hardware to run it, the API is a legitimate option.

DeepSeek V4 Flash API pricing: $0.10/M input, $0.20/M output tokens.
Qwen3.6-35B-A3B API pricing: $0.15/M input, $1.00/M output tokens.

For most workloads, V4 Flash’s API is cheaper, faster, and has better benchmarks than self-hosting it on anything under $10,000 of hardware. The RunPod vs local GPU guide covers the break-even math in detail — including when local inference actually saves money.

If you want to experiment with V4 Flash on rented GPU hardware before committing to a multi-GPU build, RunPod has H100 and H200 instances available by the hour. An H100 80GB at FP8 gets you ~20 tok/s with V4 Flash — enough to evaluate whether the model quality justifies the local hardware investment.

Decision guide: which model to actually run

8–12 GB VRAM (RTX 3060, RTX 4060 8GB): Qwen3 8B at Q4_K_M. Strong enough for coding assistance and everyday tasks. DeepSeek V4 of any variant is not possible. See the GPU buying guide if you’re still choosing hardware.

12 GB VRAM (RTX 3060 12GB, RTX 4070 12GB): Qwen3 14B at Q4_K_M. Noticeably stronger reasoning than 8B at minimal hardware cost increase. RTX 3060 delivers ~23–29 tok/s; RTX 4070 around 60 tok/s.

16–20 GB VRAM: Qwen3 30B-A3B at Q4. The quality-per-dollar peak — 30B total parameters, only 3B active per token, and it fits in a mid-range VRAM budget.

24 GB VRAM (RTX 3090, RTX 4090): Qwen3.6-35B-A3B at Q3 using Unsloth’s GGUF (23 GB, ~120 tok/s) or Qwen3 32B at Q4_K_M (~20 GB, slower but denser). Either outperforms any DeepSeek option at this VRAM level. The Qwen3.6-27B inference guide covers related setup if you want the 27B variant instead.

2–4× RTX 4090 (48–96 GB total): Still Qwen3 for interactive work. If you specifically need V4 Flash, it runs with partial CPU offloading at 8–12 tok/s — acceptable for batch jobs, not conversational AI. Running separate Qwen3 instances per GPU often serves more users at better latency.

Dedicated server with 192+ GB VRAM (dual RTX PRO 6000 or similar): DeepSeek V4 Flash becomes competitive. At this tier it reaches ~111 tok/s single-stream, which is faster than the public API and justifies the hardware investment if you’re running continuous workloads.

No local hardware, just want frontier quality: V4 Flash API at $0.10/M input. Cheapest path to the top of the benchmark chart.

Common setup errors

Running V4 Flash with standard llama.cpp builds will fail — the DeepSeek V4 architecture (called deepseek4 internally) requires llama.cpp built from PR #22378 or later. Attempting to load a V4 Flash GGUF with an older build produces a model type error at load time, not at inference time, so the error message is usually clear.

For Qwen3 MoE models, the common issue is context length vs available VRAM. Loading the 35B-A3B at Q3 (23 GB) on a 24 GB card leaves only ~1 GB for KV cache. At default context settings (4096–8192 tokens), this is fine. If you bump context to 32K or higher, the KV cache can overflow. Set --ctx-size 4096 when starting llama-server until you’ve verified your specific setup handles larger contexts cleanly.

FAQ

Can I run DeepSeek V4 Flash on a single RTX 4090?
No. At Q2_K (the lightest quantization with reasonable quality), V4 Flash needs 103 GB of VRAM. An RTX 4090 has 24 GB. Even with maximum CPU offloading you’d see under 2 tok/s — technically running but not useful.

How does Qwen3’s 30B-A3B compare to Qwen3 32B?
The 30B-A3B is faster at interactive speeds (120+ tok/s vs ~35 tok/s on RTX 3090) because it activates only 3B parameters per token. The 32B dense model edges ahead on some hard reasoning benchmarks. For coding and everyday tasks, most users won’t notice the quality difference but will notice the 3–4× speed difference.

Is DeepSeek V4 Flash multimodal?
Yes — V4 supports image input alongside text. Multimodal inference needs additional VRAM overhead on top of the base model requirement, making the hardware bar even higher.

What about DeepSeek’s smaller distilled models?
DeepSeek R1 distills (7B, 14B, 32B based on Qwen2.5/Llama architectures) run fine on consumer hardware and are covered in the DeepSeek R1 distilled guide. V4-specific distills at consumer sizes were not widely available at time of writing.

Which inference backend should I use for Qwen3 MoE?
llama.cpp gives 3–10% better single-user speed. Ollama is simpler and adds a REST API automatically. For multi-user or production serving, vLLM handles MoE routing correctly and scales better under concurrent load.

Sources

Last updated June 6, 2026. Prices and hardware availability change; verify before purchasing.

Was this article helpful?