Jun 28, 2026

How to Run a 70B Model on a Single 24GB GPU in 2026 (and When You Shouldn't)

By RunAIHome Team · 11 min read

local-llmgpullamaollamallama-cppvram

TL;DR: A 70B model at Q4_K_M needs ~43–45 GB of VRAM, and a single 24 GB card holds barely half of it. You can run one with partial offload, but you’ll get 6–12 tok/s instead of 30+, and the honest move for most people is a 32B-class model that fits fully. This guide shows the exact -ngl math, the KV-cache trick that buys you 4–6 extra layers, and the speed you’ll actually see.

What you’ll be able to do after this:

Calculate the correct offload layer count for your card and context length, instead of guessing
Cut KV-cache VRAM roughly in half with one flag, freeing room for more GPU layers
Decide — with numbers, not vibes — whether a 70B is worth the speed hit on your hardware

Honest take: On one 24 GB GPU, a 70B at Q4 runs but crawls (single-digit to low-teens tok/s). If you need 70B-class quality, rent a 48 GB card by the hour; if you need speed on hardware you own, run a 32B-class model that fits in 24 GB.

The core problem: the model is twice the size of your card

Here’s the wall everyone hits. A modern dense 70B like Llama 3.3 70B at Q4_K_M is about 42.5 GB of weights (bartowski’s widely-used GGUF), and once you add the KV cache, compute buffers, and CUDA overhead, you need ~43–45 GB of VRAM to run it entirely on the GPU. A single RTX 3090 or RTX 4090 gives you 24 GB. That’s roughly half of what a fully-resident 70B Q4 wants.

The full quant ladder for a 70B, so you can see where the cliffs are:

Quant	Approx. file size	Fits fully in 24 GB?	Notes
F16 (unquantized)	~141 GB	No	Datacenter only
Q8_0	~75 GB	No	Needs 2× 48 GB
Q5_K_M	~50 GB	No	Single 48–64 GB card
Q4_K_M	~42.5 GB	No	The “good quality” baseline; needs ~48 GB to be comfortable
Q3_K_M	~34 GB	No	Noticeable quality drop
Q2_K	~26 GB	Almost — spills ~2 GB	Quality degrades visibly
IQ1_M	~16.75 GB	Yes	Barely coherent; not worth it

File sizes from bartowski’s Llama-3.3-70B-Instruct-GGUF repo; VRAM-in-use runs a few GB above file size because of context and buffers.

So the only quant that fits fully on a 24 GB card is something below Q2_K — and at that point you’ve quantized away most of the reason you wanted a 70B. The realistic play is partial offload: put as many layers as fit on the GPU, run the rest on the CPU.

How partial offload actually works

A 70B GGUF has 80 transformer layers plus an output layer. With llama.cpp (and anything built on it — Ollama, LM Studio, koboldcpp), the -ngl N flag (--n-gpu-layers) decides how many of those layers live in VRAM. The rest stay in system RAM and run on the CPU.

At Q4_K_M, each of those 80 layers weighs roughly 475 MB. During every forward pass, execution bounces between GPU and CPU: the GPU-resident layers run at full speed, and the CPU layers run at whatever your system RAM bandwidth allows — which is an order of magnitude slower. That handoff is why partial offload is functional but never fast.

The math for a 24 GB card:

24 GB total minus ~2 GB for CUDA context, the OS, and compute buffers = ~22 GB usable
Reserve 2–4 GB for the KV cache (more if you want long context)
That leaves ~18 GB for weights → ~18,000 MB ÷ 475 MB ≈ 38 layers

In practice people land at 40–45 layers on a 24 GB card by trimming context length and quantizing the KV cache. That puts roughly half the model on the GPU. Community reports put a single RTX 3090 running Llama 70B Q4_K_M at about 8 tok/s with basic settings — slow, but usable for non-interactive work.

The relationship is roughly linear: with no GPU at all you’re at 2–4 tok/s (pure CPU), partial offload gets you into the 8–15 tok/s range, and a card that holds the whole thing does 30+ tok/s. If you offload too few layers you barely beat CPU; the win comes from getting as many layers onto the GPU as your VRAM physically allows.

Step 1: Find your real layer ceiling

Don’t guess -ngl. Start high and let the loader tell you the truth.

With llama.cpp directly:

./llama-cli -m Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  -ngl 99 -c 4096 -fa \
  -p "Explain partial GPU offload in one paragraph."

-ngl 99 means “offload everything you can.” On a 24 GB card it won’t fit, and you’ll get a CUDA out-of-memory error — that’s expected. Watch the log line that reports offloaded XX/81 layers to GPU before it dies, then back the number down. If you OOM at 81, try 45, then nudge up until it loads with a few hundred MB to spare. (If you keep hitting OOM and don’t understand why, our CUDA out-of-memory fix guide walks every cause.)

With Ollama, the equivalent is a Modelfile parameter or an environment-driven setting:

# In a Modelfile:
FROM llama3.3:70b
PARAMETER num_gpu 45
PARAMETER num_ctx 4096

num_gpu is Ollama’s name for -ngl. Set it too high and Ollama will silently spill into shared memory on Windows (which tanks speed) or OOM on Linux. Set it just below your ceiling.

One gotcha worth knowing: if your model keeps unloading and reloading between prompts, that’s a separate keep-alive issue, not an offload problem — see why Ollama keeps reloading the model.

Step 2: Quantize the KV cache to buy back layers

This is the single highest-leverage trick on a memory-starved card. The KV cache stores attention keys and values for every token in context, and at the default F16 precision it eats VRAM fast — several GB at long context. Running it at q8_0 cuts that roughly in half with minimal quality impact, which frees room for 4–6 more transformer layers on the GPU.

In Ollama, KV-cache quantization requires Flash Attention, so you set both:

# Set these where the Ollama *service* reads them (systemd, launchctl, or
# the Windows user env), not just your shell:
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0

In raw llama.cpp:

./llama-cli -m model.gguf -ngl 45 -fa \
  -ctk q8_0 -ctv q8_0 -c 8192

-fa enables Flash Attention; -ctk/-ctv set the K and V cache types. Two warnings the docs bury:

Not every architecture supports Flash Attention. If you force q8_0 on an unsupported model, Ollama silently falls back to F16 — so you budget for half the cache and then OOM unexpectedly. Verify with ollama ps that the model loaded at the size you expected.
q4_0 cache cuts VRAM further but the quality loss becomes noticeable, especially on long-context reasoning. Stick with q8_0 unless you’re desperate for those last few hundred MB.

With q8_0 cache and Flash Attention on, a 24 GB card that topped out at 40 layers can often reach 44–46, and every extra layer on the GPU is a measurable speed gain.

Step 3: Keep context realistic

Context length is the other VRAM lever, and it’s the one people forget. The KV cache scales linearly with context: doubling -c from 4096 to 8192 roughly doubles cache VRAM, which costs you GPU layers. On a 24 GB card running a 70B, 4096–8192 tokens is the sweet spot. If you genuinely need 32K+ context, you’re back to needing more VRAM — there’s no free lunch.

Also make sure your system RAM can hold the half of the model that lives on the CPU. A Q4 70B needs ~42 GB total, so with ~22 GB on the GPU you need at least ~24 GB of free system RAM for the rest plus headroom — meaning 48 GB of RAM minimum, 64 GB comfortable. If you’re short, that’s a system RAM sizing problem, and the symptom is the Linux OOM killer terminating your runner mid-load.

When you shouldn’t do this at all

Here’s the part most “how to run 70B” posts skip. Partial offload works, but at 8–12 tok/s a 70B is painful for anything interactive — that’s slower than most people read, and a long agentic task can take minutes per turn. Before you commit, weigh the two genuinely better options:

Run a 32B-class model that fits fully. A dense 32B or a sparse MoE like Qwen3.6 35B-A3B fits in 24 GB at Q4 with context headroom and runs at 30+ tok/s — and on most real tasks the quality gap to a 70B is smaller than the 3–4× speed gap. This is the recommendation for the majority of single-24 GB-card owners. See our best local models by VRAM tier for picks, and for coding specifically, aicoderscope.com’s local coding model roundup tracks which ones are worth the VRAM.

Rent a 48 GB card by the hour. If you need true 70B-class output and only occasionally, a single 48 GB cloud GPU holds Llama 3.3 70B Q4 fully and runs it at 30+ tok/s with no offload gymnastics. On RunPod an A6000-class 48 GB card runs a few dimes an hour — cheaper than the electricity-plus-frustration of crawling through offload on hardware that was never sized for it. The rent-vs-buy math tips toward renting the moment your usage is bursty rather than constant.

The one case where 24 GB partial offload is the right call: you already own the card, your workload is batch/overnight (summarizing a document pile, generating training data), latency doesn’t matter, and you specifically need 70B reasoning that a 32B can’t match. For that, offload is exactly the tool.

A complete working config

Putting the levers together, here’s a known-good starting point for a 24 GB RTX 3090 or 4090 running Llama 3.3 70B Q4_K_M:

# Ollama service environment
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0

# Modelfile
FROM llama3.3:70b
PARAMETER num_gpu 45
PARAMETER num_ctx 8192

Load it, run a prompt, and check ollama ps: you want it reporting close to 100% of the layers you intended on GPU and a VRAM figure a few hundred MB under 24 GB. Expect roughly 10–14 tok/s with this config on a 3090 — slower than a 32B, fast enough for non-interactive work, and dramatically better than the 2–4 tok/s you’d get with everything on the CPU.

FAQ

Can a 24 GB GPU run a 70B model at all? Yes, via partial offload — roughly 40–46 of the 80 layers live on the GPU and the rest run on the CPU. Expect 8–14 tok/s at Q4_K_M, versus 30+ tok/s if the whole model fit. It cannot hold a Q4 70B entirely; that needs ~43–45 GB of VRAM.

What’s the best quant for a 70B on 24 GB? Q4_K_M with partial offload gives the best quality-per-effort. Q2_K (~26 GB) almost fits fully but degrades quality enough that a 32B at Q4 usually beats it. Don’t go below Q3 chasing a full GPU fit.

Does KV-cache quantization hurt quality? q8_0 has minimal measurable impact and is the recommended default — it halves cache VRAM and requires Flash Attention enabled. q4_0 saves more but introduces noticeable degradation on long-context tasks.

Is two used RTX 3090s better than offloading on one? For 70B, yes — two 24 GB cards give 48 GB total and run a Q4 70B fully on GPU at 20–30 tok/s. At ~$1,050–$1,240 each on the used market in June 2026, a dual-3090 build is the cheapest path to comfortable 70B speed. See our used RTX 3090 value analysis.

Why is my offloaded 70B so slow even with 45 layers on the GPU? The CPU-resident layers are the bottleneck, and they run at system-RAM bandwidth. Faster RAM (DDR5 over DDR4) and quantizing the KV cache to fit more layers on the GPU are the only real fixes short of more VRAM.

Sources

Last updated June 28, 2026. Prices and specs change; verify current rates before purchasing.

Recommended Gear

RTX 3090 (used, 24GB) — best value 24 GB card for offload builds; ~$1,050–$1,240 used in June 2026
RTX 4090 (24GB) — faster compute, same 24 GB ceiling, same offload limits

Was this article helpful?