Jun 29, 2026

Ollama Slow? How to Get More Tokens per Second From the GPU You Already Have (2026)

By RunAIHome Team · 9 min read

ollamalocal-llmgpuperformancevram

TL;DR: Ollama feeling slow is almost never a “your GPU is too weak” problem — it’s usually a default that’s costing you VRAM or forcing CPU offload. Decode speed is bound by memory bandwidth, so the wins come from fitting more of the model on the GPU and stopping it from reloading. Five settings do 90% of the work.

What you’ll be able to do after this:

Read Ollama’s own logs to know why you’re slow (CPU spillover vs. cold start vs. oversized context)
Free enough VRAM with flash attention + KV-cache quantization to push more layers onto the GPU
Stop the model reloading between prompts and trim a bloated context window down to what you actually use

Honest take: If ollama ps shows anything other than 100% GPU, fixing that one thing will beat every other tweak combined. Get the whole model onto the card first; only then bother tuning num_batch and context.

First, find out why it’s slow

Before changing anything, run this while a model is loaded:

ollama ps

The PROCESSOR column is the whole game. 100% GPU means the model is fully resident in VRAM and you’re getting that card’s real speed. Anything like 48%/52% CPU/GPU means part of the model spilled into system RAM, and that is your bottleneck — CPU-resident layers run at system-memory bandwidth, an order of magnitude slower than VRAM.

Why bandwidth and not clock speed? During text generation, the model reads every one of its weights from memory for each token it produces. A 7B model at Q4 is ~4.5 GB; to emit 50 tokens/sec it streams that 4.5 GB fifty times a second. That’s why an RTX 3090 at 936 GB/s does roughly 95 tok/s on a 7B model, an RTX 4090 lands around 135 tok/s, and an RTX 3060 12GB sits near 45 tok/s — the gap tracks memory bandwidth far more than core count. (For the full picture of why decode is bandwidth-bound, see why local LLMs got good in 2026.)

So the priority order is fixed: (1) get 100% GPU, (2) keep it loaded, (3) optimize throughput. Everything below follows that order.

Fix 1: Enable flash attention (it may already be on)

Flash attention is a more memory-efficient way to compute attention. It shrinks the KV cache footprint — community testing puts the VRAM saving around 30–50% depending on context length — and that freed VRAM is exactly what lets you push spilled layers back onto the GPU.

Good news for recent setups: as of Ollama v0.30.0 (May 13, 2026), flash attention turns on automatically for modern architectures — qwen3, qwen3moe, qwen3vl, gemma3, gpt-oss, and mistral3 — when you’re on an Ampere-or-newer NVIDIA GPU (RTX 30-series and up) or an RDNA3-or-newer AMD card (RX 7000-series and up). If you run those models on that hardware, you’re already getting it.

For everything else, turn it on explicitly. Set it where the service reads it, not in your shell — this is the same trap people hit with Ollama not using the GPU:

# Linux (systemd) — the only place that sticks:
sudo systemctl edit ollama.service
# add under [Service]:
#   Environment="OLLAMA_FLASH_ATTENTION=1"
sudo systemctl restart ollama

On macOS use launchctl setenv OLLAMA_FLASH_ATTENTION 1 (and add it to a LaunchAgent so it survives reboot); on Windows set it as a user environment variable and quit/restart Ollama from the tray.

Fix 2: Quantize the KV cache

The KV cache stores the attention keys/values for every token in your context, and it grows with context length. Ollama’s default cache type is f16. Quantizing it to q8_0 uses roughly half the memory of f16 with a very small precision loss — and that’s often the difference between 100% GPU and a CPU spillover.

OLLAMA_KV_CACHE_TYPE=q8_0   # ~1/2 the VRAM of f16, negligible quality loss
OLLAMA_KV_CACHE_TYPE=q4_0   # ~1/4, only if you're truly tight — measurable degradation

Critical dependency: KV-cache quantization only works when flash attention is enabled. On builds where flash attention isn’t supported for your model, older Ollama versions would abort the load entirely if you’d set a quantized cache — so pair the two settings, and if a model refuses to load, suspect the cache type first. q8_0 is the safe recommendation; reach for q4_0 only on a card where the alternative is CPU offload.

Fix 3: Stop the model from reloading

A common “Ollama got slow” complaint is really a cold start in disguise. By default Ollama unloads a model from VRAM after OLLAMA_KEEP_ALIVE=5m of inactivity. Your next prompt then waits while it reloads the weights from disk — seconds for a 7B, over a minute for a 70B on a slow drive.

Pin the model in memory:

OLLAMA_KEEP_ALIVE=-1     # keep loaded indefinitely
# or a generous window:
OLLAMA_KEEP_ALIVE=30m

Set it on the service (same mechanism as Fix 1). If the model still reloads on every request even with this set, you’re hitting a different issue — model swapping under OLLAMA_MAX_LOADED_MODELS — covered in detail in Ollama keeps reloading the model.

Fix 4: Right-size your context window

num_ctx defaults to 4096 tokens. Here’s the catch: the KV cache is allocated for the full num_ctx up front, whether you use it or not. If you bumped it to 32K or 128K “just in case” and your actual prompts are a few hundred tokens, you’re burning VRAM that could be holding model layers — which forces CPU offload, which is the slow you’re feeling.

Set it to what you actually need:

# per-request in the CLI:
ollama run qwen3 --ctx-size 8192

# or globally for the server:
OLLAMA_CONTEXT_LENGTH=8192

For chat and short coding tasks, 4096–8192 is plenty. Reserve big contexts for documents you’re genuinely feeding in whole — and when you do, that’s exactly when Fixes 1 and 2 pay for themselves, because a quantized KV cache at 32K saves a lot more absolute VRAM than at 4K. (Running out of memory anyway? The full triage is in CUDA out of memory: every fix that works.)

Fix 5: Pick the right quant, then push num_batch

Two throughput levers, in priority order.

Quantization is the bigger one. A 7B at Q4_K_M runs roughly 2× faster than the same model at Q8_0, because there are fewer bytes per weight to stream — and on a bandwidth-bound workload, fewer bytes means more tokens. Q4_K_M is the value sweet spot for most models; going below it (Q3, Q2) trades real quality for diminishing speed gains. We break the quality-vs-size tradeoff down with numbers in Q4 vs Q5 vs Q6 vs Q8 quantization.

num_batch is the fine-tune. It defaults to 512 and controls how many tokens are processed in parallel during prompt evaluation. If you have VRAM headroom after the model fits at 100% GPU, raising it speeds up prompt processing (time-to-first-token on long inputs):

ollama run qwen3 --num-batch 1024   # more prompt-eval throughput, more VRAM

The order matters: only raise num_batch once ollama ps reads 100% GPU. Spending VRAM on a bigger batch while the model is spilling to CPU makes things worse, not better.

Putting it together

A solid default service config for a 24GB card running modern models:

OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_KEEP_ALIVE=30m
OLLAMA_CONTEXT_LENGTH=8192

Then per workload: pick Q4_K_M, confirm 100% GPU in ollama ps, and only then experiment with num_batch. One honest caveat on expectations — Ollama carries its Go server layer on top of llama.cpp, so it runs about 3–10% slower than raw llama.cpp numbers you’ll see quoted. That’s the cost of the convenience, and it’s not something a setting fixes.

The thing that won’t help: chasing CPU thread counts or clock speeds when the model is already on the GPU. Decode is bandwidth-bound. If you’ve done all five fixes and still want more tokens/sec, the lever is hardware — and the used RTX 3090 at 936 GB/s remains the best bandwidth-per-dollar pick for local inference. If you’re tuning Ollama specifically for a coding workflow, our sister site aicoderscope.com covers the editor-side setup.

FAQ

Does enabling flash attention reduce quality? No. Flash attention is mathematically equivalent to standard attention — it just computes it in a more memory-efficient order. The only quality consideration is the KV-cache quantization it enables: q8_0 is near-lossless, q4_0 is noticeable.

Why is my first message after a pause slow but the rest fast? That’s a cold-start reload. The model was unloaded after the 5-minute keep-alive timeout. Set OLLAMA_KEEP_ALIVE=-1 or 30m to pin it.

I set OLLAMA_FLASH_ATTENTION=1 and nothing changed. Why? Either you set it in a shell instead of on the service (systemd/launchctl/Windows env), or you’re on Ollama v0.30+ running a modern model where it was already on by default, so there was no change to see.

Is q4_0 KV cache worth it? Only when the alternative is CPU offload. q4_0 uses ~1/4 the VRAM of f16 but the precision loss is measurable on longer contexts. Try q8_0 first; drop to q4_0 only if q8_0 still leaves you spilling to CPU.

My model still won’t fit at 100% GPU after all this. Then it’s genuinely too big for your VRAM and you’re in partial-offload territory. See how to run a 70B model on a single 24GB GPU for the offload math, or step down to a model that fits — a Qwen3.6 35B-A3B MoE gives you 30B-class quality at 30+ tok/s on 24GB.

Sources

Last updated June 29, 2026. Ollama defaults and version behavior change between releases; verify against your installed version with ollama --version and ollama ps.

Recommended Gear

RTX 3090 — 24GB, 936 GB/s; best bandwidth-per-dollar for local inference on the used market
RTX 4090 — 24GB, fastest single-card 7B tok/s in this list (~135)
RTX 3060 12GB — entry 12GB card; ~45 tok/s on a 7B, fine for smaller models

Was this article helpful?