May 26, 2026

DeepSeek R1 Distilled Models for Local AI: Which Version Fits Your GPU (2026)

By RunAIHome Team · 13 min read

deepseek-r1local-aillmvraminferenceollamagpubuying-guide

DeepSeek R1 is a reasoning model — it “thinks out loud” before answering, producing measurably better results on math, coding, and logic than instruction-tuned chat models of similar size. The full 671B version is server territory. The six distilled variants are not. They range from 1.1GB on disk to 43GB, cover every consumer GPU tier from an 8GB RTX 3060 up to dual RTX 3090s, and even the 7B distill surpasses QwQ-32B-Preview on AIME 2024 math benchmarks — a larger model, outrun by something that fits in 8GB of VRAM.

The decision you’re making isn’t “should I run DeepSeek R1 locally.” It’s “which distilled size won’t bottleneck my hardware.”

Why distilled models exist (and why 671B isn’t on this list)

The full DeepSeek-R1 is a Mixture-of-Experts (MoE) model with 671 billion total parameters. At Q4_K_M quantization, the GGUF weights come in at roughly 404GB — you need either a multi-GPU server with 512GB of VRAM or one of the extreme quantizations (Unsloth’s IQ1_S at 131GB, for comparison). That’s not a home lab setup.

Instead, DeepSeek trained six smaller dense models by fine-tuning existing open-source checkpoints on 800,000 samples of R1’s reasoning chains. The base models are Qwen2.5 (for the 1.5B, 7B, 14B, and 32B variants) and Llama 3 (for the 8B and 70B). The distillation process transfers R1’s chain-of-thought reasoning style into a much smaller architecture — the 7B distill still hits 55.5% on AIME 2024 math problems, a score that would have been considered strong frontier-model performance a year ago.

VRAM and disk requirements at a glance

All sizes shown at Q4_K_M quantization, which is the Ollama default and the recommended starting point for home inference. VRAM figures include headroom for KV cache at normal context lengths.

Model	Ollama tag	Download size	Min VRAM	Comfortable VRAM
R1 Distill Qwen 1.5B	`deepseek-r1:1.5b`	1.1 GB	4 GB	6 GB
R1 Distill Qwen 7B	`deepseek-r1:7b`	4.7 GB	6 GB	8 GB
R1 Distill Llama 8B	`deepseek-r1:8b`	5.2 GB	6 GB	8 GB
R1 Distill Qwen 14B	`deepseek-r1:14b`	9.0 GB	12 GB	16 GB
R1 Distill Qwen 32B	`deepseek-r1:32b`	20 GB	24 GB	24 GB
R1 Distill Llama 70B	`deepseek-r1:70b`	43 GB	40 GB	48 GB

The 32B fits exactly in a 24GB card (RTX 3090 or RTX 4090) with about 4GB of headroom for the KV cache. The 70B does not fit in any single consumer GPU — it needs dual 24GB cards or Apple Silicon with ≥64GB unified memory.

Tier 1: 1.5B — testing and edge devices

The 1.5B distill is based on Qwen2.5-Math-1.5B, a math-specialized base, which gives it disproportionate strength on numerical problems: 28.9% on AIME 2024 and 83.9% on MATH-500 from a model that weighs 1.1GB.

That math score (83.9% on MATH-500) matches what full-scale GPT-3.5-class models produced two years ago. For general conversation and creative writing, the 1.5B will disappoint — the context window fills quickly with its reasoning chains and it often loses track of multi-step instructions. But for a math homework assistant or a low-power edge inference setup, it’s remarkable that it runs at all in 4GB of VRAM.

If you’re on a laptop with an RTX 4060 (8GB) and want to experiment with R1’s reasoning behavior without committing, the 1.5B is your zero-risk entry point. It also runs fully on CPU with 16GB of system RAM at 5–10 tokens/sec — tolerable for occasional use.

Tier 2: 7B and 8B — the 8GB sweet spot

Two distills occupy this tier and they use different base models, which matters.

R1-Distill-Qwen-7B (based on Qwen2.5-Math-7B): 55.5% on AIME 2024, 92.8% on MATH-500. The Qwen2.5-Math base gives it exceptional numerical reasoning for its size. At 4.7GB download and 6-8GB VRAM, this fits on an RTX 3060 with room to spare.

R1-Distill-Llama-8B (based on Llama-3.1-8B): 50.4% on AIME 2024, 89.1% on MATH-500. Marginally weaker on math benchmarks than the 7B despite being larger, because the Llama base is a generalist rather than math-specialized. The advantage: the Llama-3.1-8B base is better on general instruction following and multi-turn conversation. For coding tasks and mixed-domain use, the 8B often feels more capable in practice even though the math benchmark says otherwise.

For most 8GB GPU owners, the 7B is the better technical choice and the 8B is the better practical choice. Run both and see which answers feel more coherent for your use cases — the performance difference on real tasks is smaller than the benchmark gap suggests.

On an RTX 4090, either model runs at roughly 80–100 tok/s. On an RTX 3060 or RTX 4060 (12GB or 8GB), expect 40–60 tok/s. Both are fast enough that you won’t notice the reasoning latency in the chain-of-thought — the throughput keeps up with your reading speed.

Tier 3: 14B — the 16GB upgrade

The 14B distill (based on Qwen2.5-14B) is where reasoning quality makes a noticeable jump: 69.7% on AIME 2024 and 93.9% on MATH-500. It scores within 3 percentage points of the 32B on math, at half the VRAM.

At Q4_K_M, the 14B weights are 9.0GB on disk. An RTX 4060 Ti 16GB runs it fully in VRAM with 7GB of headroom for context. An RTX 4070 Super (12GB) can run it with tight headroom — the model weights fit at ~9GB, but you’ll need to cap your context length to avoid OOM errors. A 16GB card is the comfortable minimum.

Benchmark performance on RTX 4090: 58.6 tok/s for the 14B distill running under Ollama. On RTX 3090 (24GB, 936 GB/s bandwidth): roughly 45–50 tok/s — the 3090 is close enough to the 4090 in memory bandwidth that the gap is small for this model size.

This is the tier most home lab builders should target if they have 16GB VRAM. The quality jump from 7B to 14B is larger than any hardware upgrade you’ll make for the same money. You’re getting frontier-2024 reasoning capability in a $300–400 GPU config.

Tier 4: 32B — the 24GB card’s reason to exist

The 32B distill (based on Qwen2.5-32B) is the practical ceiling for single-card consumer inference. At 20GB download and ~20GB VRAM for the weights, it fits exactly in an RTX 3090 or RTX 4090 with 4GB left for the KV cache.

Quality numbers: 72.6% on AIME 2024, 94.3% on MATH-500, 57.2% on LiveCodeBench. That math score is within 2.6 percentage points of the full 671B R1 model. For coding tasks, it performs at a level that was competitive with GPT-4 at launch.

Performance on RTX 4090 (1008 GB/s bandwidth): approximately 38–42 tok/s for the 32B at Q4_K_M under Ollama. With a reasoning model, this matters more than it would for a chat model — R1’s thinking chains can run 500–2000 tokens before the actual answer appears. At 40 tok/s, a 1000-token reasoning chain takes about 25 seconds, then the answer follows immediately. Most users find this tolerable for tasks where quality matters more than latency.

Performance on RTX 3090 (936 GB/s bandwidth): 28–35 tok/s. The 3090 is within 10–15% of the 4090 in memory bandwidth, so inference speed tracks closely. If you already have a 3090, the 32B is its intended workload.

The 32B is not the right pick for fast-turnaround coding assistance where you want sub-2-second first-token responses. For that use case, drop to the 7B or 14B. The 32B earns its VRAM for tasks where you want the best answer you can get locally: research, complex math, architecture decisions, long-document analysis.

Tier 5: 70B — dual-GPU or Mac Studio territory

The 70B distill (based on Llama-3.3-70B-Instruct) needs 40GB+ of VRAM. On consumer hardware, that means either dual RTX 3090s (48GB total via PCIe tensor parallelism) or Apple Silicon with 64GB+ unified memory.

Quality: 70.0% on AIME 2024 (pass@1), 94.5% on MATH-500, 57.5% on LiveCodeBench. Notably, the 70B barely outperforms the 32B on math tasks despite being 2.2× larger. The gains are on general reasoning, long-context coherence, and tasks that benefit from the Llama-3.3 base’s stronger general instruction following.

Performance on dual RTX 3090 (PCIe 4.0 tensor parallel): 15–22 tok/s. This is the constraint: inter-GPU communication via PCIe eats throughput you’d otherwise get from the raw 936 GB/s bandwidth of each card. For a slow-but-capable local reasoning engine, it’s acceptable. For anything interactive, the latency on a long reasoning chain (2000 tokens at 18 tok/s = 110 seconds) tests patience.

The honest comparison: a single Mac Studio M4 Max (128GB unified memory) runs the 70B significantly faster and with better power efficiency than dual RTX 3090s. If the 70B is your target, the Mac Studio deserves serious consideration — see our Mac Studio M3 Ultra vs Dual RTX 4090 comparison for the full breakdown.

For most home lab setups, the 32B is the better choice: similar quality on the benchmarks that matter, faster inference, and no multi-GPU complexity.

GPU performance summary

Tokens per second at Q4_K_M quantization, single-user inference via Ollama/llama.cpp:

GPU	VRAM	7B/8B	14B	32B	70B
RTX 4090	24 GB	80–100 tok/s	~59 tok/s	38–42 tok/s	N/A (won’t fit)
RTX 3090	24 GB	70–85 tok/s	~47 tok/s	28–35 tok/s	N/A (won’t fit)
RTX 4070 Ti Super	16 GB	65–80 tok/s	~42 tok/s	N/A	N/A
RTX 4060 Ti 16GB	16 GB	45–55 tok/s	~35 tok/s	N/A	N/A
RTX 3060 / 4060	12 GB	40–55 tok/s	marginal	N/A	N/A
Dual RTX 3090	48 GB	—	—	—	15–22 tok/s

N/A means the model weights plus KV cache won’t fit in that VRAM tier. The 14B on a 12GB card is “marginal” — the weights technically fit at 9GB, but you’ll need to limit context length to ~4K tokens or risk OOM errors.

What changed in R1-0528

On May 28, 2025, DeepSeek released an updated version: DeepSeek-R1-0528. Two model sizes received updates — the 8B distill and the full 671B.

The full model improvement is significant: AIME 2025 accuracy jumped from 70% to 87.5%, the average reasoning chain length increased from 12,000 to 23,000 tokens, and the model gained proper function calling and JSON output support. DeepSeek also reports reduced hallucination rates.

For local inference, the relevant update is the 8B distill: the 0528 version uses Qwen3-8B as its base model instead of Llama-3.1-8B, which improves its general instruction following and multilingual capability. Ollama serves this as deepseek-r1:8b (automatically updated). The 1.5B, 7B, 14B, 32B, and 70B distilled checkpoints were not updated in the 0528 release — those remain at the original January 2025 weights.

If you’re running the 8B, pull the latest Ollama image to get the 0528 version. If you’re on any other distilled size, nothing changed.

How to run (Ollama is the fastest path)

Install Ollama, then:

# Start with the 7B to verify everything works (~4.7GB download)
ollama run deepseek-r1:7b

# Move up to 14B once you confirm your GPU has 12GB+ headroom
ollama run deepseek-r1:14b

# The 32B sweet spot for 24GB cards
ollama run deepseek-r1:32b

# 70B for dual-GPU setups (will auto-split across both cards in llama.cpp)
ollama run deepseek-r1:70b

The default quantization for every Ollama tag is Q4_K_M. If you want higher quality at the cost of more VRAM, you can pull GGUF variants directly from Hugging Face (bartowski’s repos are the most maintained) and run them via ollama create.

LM Studio also supports all six distilled sizes with its built-in model search — search “deepseek-r1” and filter by your VRAM tier. The LM Studio UI makes it easier to experiment with different quantization levels without touching the command line.

Honest take: which tier should you actually run?

8GB VRAM (RTX 3060, 4060): Run the 7B. It’s fast, it fits with room to spare, and its math benchmark (55.5% AIME 2024) puts it well above any chat model you’d run at this VRAM tier. The 8B 0528 update is also worth trying for mixed-domain work.

16GB VRAM (RTX 4060 Ti, 4070, 4070 Super): The 14B is the clear answer. The quality jump from 7B to 14B (55.5% → 69.7% AIME) is larger than the jump from 14B to 32B (69.7% → 72.6%), and you’re getting it for free — you already have the VRAM. Don’t run the 7B when the 14B fits.

24GB VRAM (RTX 3090, 4090): The 32B is the right call for quality-sensitive work. Keep the 14B installed for fast iteration (coding, quick questions) and the 32B for when you need the best answer. Both fit simultaneously if you have 24GB — Ollama manages which one is loaded.

Dual 24GB cards or Mac Studio 64GB+: The 70B is your target, but only if the slower inference doesn’t frustrate you. Most users in this hardware tier report that the 32B gives 95% of the quality at 2× the throughput, and they default to the 32B in daily use.

If you don’t have the GPU for the tier you want, RunPod lets you rent an RTX 4090 pod by the hour ($0.34/hr on Community Cloud as of May 2026). A session of 50 R1-32B queries takes roughly 20–30 minutes of inference time at 40 tok/s, which comes to under $0.20 in cloud rental — worth it for occasional high-quality inference before committing to hardware. See our RunPod vs local GPU breakdown for the break-even math.

The R1 distilled family is one of the best arguments for local inference in 2026: open weights, no usage limits, no data leaving your machine, and quality that would have been considered frontier-tier twelve months ago. Pick the size that fills your VRAM, run it tonight.

1V1 PLAYBOOK · LOCAL LLM

Cut your local AI bill from $400/month cloud GPU to $47/month at home.

4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.

Get it for $19 (early bird) →

Sources

Last updated May 26, 2026. GPU prices and cloud rental rates change frequently; verify current rates before purchasing.

Recommended Gear

The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):

Was this article helpful?