Jun 14, 2026

CUDA Out of Memory on Local AI? Every Fix That Works for Ollama, llama.cpp, ComfyUI, and vLLM (2026)

By RunAIHome Team · 10 min read

cudagpulocal-llmtroubleshootingcomfyuivllm

TL;DR: A “CUDA out of memory” error almost always means one of three things — your context window is too long, your KV cache or batch is reserving VRAM up front, or fragmentation is wasting memory you technically have. The fastest wins: shrink context, quantize the KV cache, and set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. You rarely need a bigger GPU; you need a tighter config.

What you’ll be able to do after this guide:

Read the error line and know which allocation blew up — model weights, KV cache, or activations
Apply the right fix per engine (Ollama, llama.cpp, ComfyUI, vLLM) instead of guessing
Free 30–60% of your VRAM without downgrading the model you actually want to run

Honest take: The number-one cause is a context window you never asked for. Tools like Ollama and vLLM will happily pre-reserve KV cache for an 8K, 32K, or 128K window even when your prompt is 400 tokens. Cap the context to what you actually use and most OOMs disappear before you touch anything else.

First, read the error — it tells you which allocation failed

The canonical message looks like this:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB.
GPU 0 has a total capacity of 23.99 GiB of which 1.43 GiB is free.

Two numbers matter: total capacity and how much was free when it failed. If it died trying to allocate a large block while several GB were still “reserved but unallocated,” that’s fragmentation, not a true shortage — different fix. If it died with almost nothing free, you genuinely overcommitted, and you need to cut something real (context, batch, model size, or precision).

Before changing anything, watch the GPU while the job runs:

$ nvidia-smi --query-gpu=memory.used,memory.total --format=csv -l 1
memory.used [MiB], memory.total [MiB]
22310 MiB, 24576 MiB

If memory.used climbs steadily until the crash, your context or batch is the leak. If it spikes at one node (a VAE decode, a long-context prefill), that single step is the culprit and the fix is local to it.

Where VRAM actually goes

Three buckets compete for the same card:

Model weights — fixed once loaded. An 8B model at Q4 is ~4.5 GB; at FP16 it’s ~16 GB.
KV cache — grows with context length and concurrent requests. This is the silent killer.
Activations / working buffers — transient, but a 4K-resolution VAE decode in ComfyUI can momentarily need several GB.

The KV cache is where most people lose. Cutting an 8B model’s context from 8192 to 2048 tokens saves roughly 1.5 GB; on a 70B model the same cut frees 6 GB or more, because the cache scales with layer count and hidden size. That’s free VRAM with zero quality loss as long as your prompts genuinely fit the smaller window. If you don’t understand quant levels yet, the quantization explainer and the Q4 vs Q5 vs Q6 vs Q8 quality breakdown are worth a detour — they decide bucket #1’s size.

Fix it in Ollama

Ollama’s OOMs are almost always a context or KV-cache problem, because it defaults to a generous context for many models.

1. Cap the context. Set it per run or bake it into a Modelfile:

# one-off
$ ollama run llama3.1:8b --ctx-size 2048

# permanent, via Modelfile
PARAMETER num_ctx 2048

2. Enable Flash Attention. It reduces KV-cache VRAM by 30–50% with no quality degradation, and it unlocks cache quantization:

$ export OLLAMA_FLASH_ATTENTION=1

3. Quantize the KV cache. With Flash Attention on, set the cache type. q8_0 halves the cache for a negligible quality hit; q4_0 cuts it to roughly a third, with some loss on very long contexts:

$ export OLLAMA_KV_CACHE_TYPE=q8_0

Flash Attention plus a q8_0 cache together let you push context lengths roughly 2× higher before you run out of memory.

One trap worth knowing: KV-cache quantization only applies to architectures on Ollama’s allowlist. Force q8_0 on an unsupported architecture and the server silently falls back to f16 — so you set the flag, see no savings, and still OOM. If quantizing the cache changes nothing, that’s why; check your model’s architecture support before assuming the flag is broken.

If Ollama still spills to CPU or refuses the GPU entirely, that’s a different symptom — see Ollama not using your GPU, which covers the driver and passthrough side.

Fix it in llama.cpp

llama.cpp gives you the most direct controls. The three levers, in order of impact:

$ ./llama-server -m model.gguf \
    -c 2048 \           # context size — the biggest lever
    -ngl 28 \           # GPU layers; lower this to keep some layers on CPU
    -fa \               # flash attention
    -ctk q8_0 -ctv q8_0 # quantize K and V cache

-ngl (number of GPU layers) is your safety valve: a model that won’t fully fit can run partially on the GPU and partially on the CPU. You lose speed for every layer that lands on the CPU, but it runs. Drop -ngl by 4–8 at a time until it loads, then check nvidia-smi for the headroom you have left. If you’re routinely offloading half the model, your VRAM tier is the real constraint — the how much VRAM for Llama models guide maps model size to the card you actually need, and system RAM matters once you’re offloading.

Fix it in ComfyUI (and Stable Diffusion / Flux)

Image and video models OOM differently: weights are smaller, but a single VAE decode or a high-resolution latent can spike VRAM hard at one node.

Launch flags are the first move. Add them to your run script:

$ python main.py --lowvram
# or, on 12GB or less:
$ python main.py --lowvram --force-fp16

--medvram moves model components to system RAM when they’re idle, cutting peak VRAM by roughly 30–40% at the cost of 10–20% slower generation. --lowvram is more aggressive — more savings, bigger speed penalty. For Flux specifically, set the Load Checkpoint node’s weight_dtype to fp8_e4m3fn, which roughly halves model VRAM.

Move the VAE off the GPU. The decode step is a common spike. Running it on the CPU costs a few seconds but saves several hundred MB to a couple of GB at the exact moment you tend to crash:

$ python main.py --lowvram --cpu-vae

Free memory between runs. ComfyUI can hold the previous model in VRAM. Drop a “Free Model and Clip Memory” node after generation, or install a memory-management pack, so back-to-back workflows don’t stack. If you’re chasing speed and VRAM together on RTX 40/50 cards, the ComfyUI NVFP4 guide covers the format that does both.

Fix it in vLLM

vLLM is the trickiest because it pre-reserves KV-cache blocks up front for throughput. With max-model-len=32768 and max-num-seqs=256, even a 7B model’s KV cache can balloon past 20 GB — before a single real request arrives.

$ vllm serve mymodel \
    --max-model-len 4096 \          # lock to your real prompt+gen length
    --gpu-memory-utilization 0.90 \ # raise carefully; lower on shared hosts
    --kv-cache-dtype fp8 \          # Ampere or newer
    --enforce-eager                 # skips CUDA-graph pre-allocation

The single most effective change is --max-model-len: set it to the longest prompt plus generation you actually serve, not the model’s theoretical maximum. vLLM reserves blocks for the full window, so any gap between your real prompt length and max-model-len is wasted VRAM. After that, --enforce-eager reclaims the memory CUDA graphs pre-allocate, --kv-cache-dtype fp8 halves cache size on Ampere or newer, and --cpu-offload-gb 4 buys headroom if you’re still short.

On gpu-memory-utilization: the default 0.90 is aggressive for a 24 GB workstation card or any shared host — start at 0.85 there. On a dedicated card, bumping to 0.95 sometimes gives vLLM just enough slack to complete the allocation rather than dying mid-startup. It cuts both ways, so change it 0.05 at a time. For when vLLM is worth this complexity over Ollama at all, see vLLM vs Ollama.

The fragmentation fix that applies everywhere PyTorch runs

If the error says plenty of memory is “reserved but unallocated” — you have free VRAM but the allocation still fails — that’s fragmentation, not a true shortage. Without expandable segments, PyTorch’s allocator calls cudaMalloc for each segment, and each one is an independent block that can never merge with another. Set this before launching (ComfyUI, vLLM, any PyTorch app):

$ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

It tells the allocator to reserve a large virtual address space and back it with CUDA’s virtual-memory APIs, so blocks can merge instead of stranding gaps. On long-running servers and image pipelines that allocate and free repeatedly, this alone resolves a surprising share of “I have memory but it still OOMs” cases.

When the fix is “buy or rent,” not “tune”

Sometimes the model genuinely doesn’t fit. A 70B model at Q4 needs ~40 GB before any context — no flag rescues that on a 16 GB card. At that point your honest options are a bigger card or cloud:

More VRAM, used: a used RTX 3090 (24 GB) remains the value pick for fitting larger models locally — see the 3090 value analysis.
New mid-range: the RTX 5060 Ti 16GB handles most 8B–14B work comfortably.
Rent for the occasional big job: spinning up a 48 GB or 80 GB instance on RunPod costs a few dollars an hour and beats buying a card you’ll use twice a month. The rent vs buy math shows where the line is.

If your OOM is in a coding workflow specifically, the model-and-tool pairing matters as much as VRAM — our sister site aicoderscope.com covers the local-coding-assistant side.

FAQ

Why does it OOM at the same point every time, mid-generation? That’s a per-step spike, not steady growth — usually a VAE decode (image models) or a long-context prefill (LLMs). Move the VAE to CPU (--cpu-vae) or cut context/resolution for that step specifically.

I have free VRAM in nvidia-smi but still get OOM. Why? Fragmentation. Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. The free memory exists but isn’t contiguous enough for the block being requested.

Does KV-cache quantization hurt quality? q8_0 is effectively lossless for most use. q4_0 is fine for short and medium contexts but can degrade coherence on very long ones. Start at q8_0; only drop to q4_0 if you still need the room.

Will reducing context make my model dumber? Only if your actual prompts exceed the new window. Capping context at 2048 when you only ever send 600 tokens costs you nothing and frees real VRAM.

Restarting fixes it temporarily — is that a real fix? No. Restarting clears accumulated fragmentation and leftover models, but it’ll return. Add the expandable-segments flag and free models between runs to fix it properly.

Sources

Last updated June 14, 2026. Flags and defaults change between releases; verify against your installed version of each tool.

Recommended Gear

RTX 3090 (24 GB) — the used value pick when the model genuinely won’t fit
RTX 5060 Ti 16GB — new mid-range card for 8B–14B local work

Was this article helpful?