Jun 10, 2026

Ollama Not Using GPU? Fix CPU-Only Inference on Windows, WSL2, and Linux (2026)

By RunAIHome Team · 12 min read

ollamagpulocal-llmtroubleshootingcuda

TL;DR: If Ollama feels slow, run ollama ps — a “100% CPU” line means your GPU isn’t being used at all, and a CPU/GPU split means the model is too big for your VRAM. Most cases come down to drivers, a WSL2/Docker passthrough gap, or VRAM overflow. The speed gap is real: ~42 tok/s on an RTX 3060 versus 8–14 tok/s CPU-only for Llama 3.1 8B.

What you’ll be able to do after this guide:

Confirm in 30 seconds whether Ollama is on your GPU, your CPU, or split between both
Fix the six causes that account for nearly every “Ollama won’t use my GPU” report in 2026
Read the Ollama server log to find the one line that tells you what actually happened

Honest take: 80% of these reports are one of two things — you installed Ollama before the NVIDIA driver was working, or your model simply doesn’t fit in VRAM and is spilling to system RAM. Check ollama ps first; it tells you which camp you’re in before you change a single setting.

Step 1: Confirm the problem (don’t guess)

Before touching drivers or reinstalling anything, find out what Ollama is actually doing. Load a model and run ollama ps:

$ ollama run llama3.1:8b "hi" 
$ ollama ps
NAME           ID              SIZE      PROCESSOR    UNTIL
llama3.1:8b    365c0bd3c000    6.7 GB    100% GPU     4 minutes from now

That PROCESSOR column is the whole diagnosis:

100% GPU — working as intended. If it’s still slow, your model/quant or context is the bottleneck, not GPU detection.
100% CPU — Ollama isn’t seeing your GPU at all. This is a driver, passthrough, or unsupported-card problem.
58% / 42% CPU/GPU (a split) — Ollama found the GPU but the model doesn’t fully fit in VRAM, so layers spilled to system RAM. The GPU is fine; you’re out of VRAM.

Cross-check with the GPU itself while a prompt is generating:

$ nvidia-smi

If nvidia-smi prints a table and you see a python/ollama process using VRAM during generation, the GPU is being used. If nvidia-smi returns command not found or NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver, your driver is the problem — jump to Cause 1.

Here’s the diagnostic flow in one table:

`ollama ps` shows	`nvidia-smi` shows	What’s wrong	Go to
100% CPU	driver error / not found	Driver missing or broken	Cause 1
100% CPU	works fine on host, fails in WSL/Docker	Passthrough not configured	Cause 2 / 4
100% CPU	works, GPU is old	Compute capability too low	Cause 5
CPU/GPU split	GPU present, VRAM full	Model bigger than VRAM	Cause 3
100% GPU on wrong card	both GPUs listed	Ollama picked the wrong GPU	Cause 6

Cause 1: Drivers missing, outdated, or installed after Ollama

This is the single most common cause. Ollama detects GPU libraries at install time and at server start, so the order of operations matters.

The version floor in 2026: Ollama supports NVIDIA GPUs with compute capability 5.0+ and driver 531 or newer. Older Maxwell/Pascal cards (compute capability 5.0–6.2, e.g. a GTX 1060) need driver 570 or newer. If your driver is below that, Ollama silently falls back to CPU.

Check your driver:

$ nvidia-smi --query-gpu=driver_version --format=csv,noheader
576.52

Then fix in this order:

Install/update the NVIDIA driver first. On Windows, grab the latest Game Ready or Studio driver. On Linux, install the proprietary driver (e.g. sudo ubuntu-drivers install on Ubuntu) and reboot.
Verify nvidia-smi works before going further.
Reinstall Ollama after the driver is healthy. If you installed Ollama before the driver worked, its server never registered CUDA support. On Linux: curl -fsSL https://ollama.com/install.sh | sh. On Windows, reinstall the app. Then systemctl restart ollama (Linux) or restart the app.

The classic trap: people install Ollama on a fresh machine, then install GPU drivers, then wonder why it’s on CPU. Reinstall Ollama last.

Cause 2: WSL2 passthrough (the Windows + Linux gotcha)

Running Ollama inside WSL2 on Windows is its own special case, and the fix is counterintuitive.

Do not install a Linux NVIDIA driver inside WSL2. The Windows host driver is automatically projected into WSL2 as libcuda.so. Installing a Linux driver on top of that breaks the stub and sends you straight to CPU. This is the mistake that generates the most WSL2 bug reports.

The working setup:

Update the Windows NVIDIA driver (must be 470.76 or later for CUDA-in-WSL2; in practice use a current driver). Windows 11, or Windows 10 21H2+, is required.
Confirm you’re on WSL2, not WSL1:
```
# in PowerShell
wsl -l -v
```
The VERSION column must say 2. WSL1 has no GPU passthrough at all.
Inside WSL2, verify the stub is visible:
```
$ nvidia-smi
```
If that works inside WSL but Ollama still shows CPU, reinstall Ollama inside WSL after confirming nvidia-smi works.

Cause 3: The model is bigger than your VRAM

If ollama ps shows a CPU/GPU split, nothing is broken — you’re out of VRAM, and Ollama is doing exactly what it’s designed to do: offload the layers that fit to the GPU and run the rest on CPU. That CPU portion is what tanks your tokens/sec.

A rough VRAM budget: a Q4_K_M quant needs about 0.6 GB per billion parameters, plus 1–2 GB for the KV cache at modest context. So Llama 3.1 8B Q4_K_M wants ~6–7 GB, which is why it fits cleanly on an 8GB card; a 14B Q4 wants ~10 GB; a 32B Q4 wants ~20 GB and will split on anything under a 24GB card.

Fixes, in order of preference:

Use a smaller quant. Drop from Q6_K to Q4_K_M, or pull the :8b instead of :14b tag. See our quantization explainer for what you actually lose (less than people think — Q4_K_M is the sweet spot).
Shrink the context window. A huge num_ctx eats VRAM through the KV cache. If you set OLLAMA_CONTEXT_LENGTH or num_ctx to 32768 “just in case,” that alone can force a split. Drop to 4096 or 8192.
Unload other models. Ollama keeps recently-used models resident. Run ollama stop <model> to free VRAM, or set OLLAMA_MAX_LOADED_MODELS=1.
Get more VRAM. If a 32B model is your daily driver, a 24GB card is the real answer — see our VRAM-by-model guide and check whether you actually have enough system RAM for the spillover, too.

The classic error here is the hard out-of-memory case:

Error: CUDA error: out of memory

Fix: lower the context (num_ctx 2048), use a smaller quant, or stop other loaded models — then retry.

Cause 4: Docker without GPU access

A container does not get GPU access by default. If you run Ollama in Docker and it’s on CPU, the host setup is incomplete.

On the host (or inside your WSL2 distro if that’s where Docker lives):

# install the toolkit, then register the runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Then launch the container with the GPU flag:

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

The two things people forget: installing nvidia-container-toolkit on the host, and the --gpus=all flag itself. Miss either and the container quietly runs on CPU.

Cause 5: Your GPU is too old (compute capability)

Ollama requires compute capability 5.0 or higher — Maxwell (GTX 900 series) and newer. A Kepler-era card (GTX 700 series, compute capability 3.5) will never get GPU acceleration in current Ollama, no matter what driver you install. Datacenter Tesla K80s fall in the same bucket.

Check your card’s compute capability against NVIDIA’s CUDA GPUs list. If it’s below 5.0, your only paths are a newer GPU or cloud rental (see Cause 7’s note). This is rare on home builds but common when someone tries to repurpose an ancient mining or server card.

Cause 6: Multi-GPU — Ollama picked the wrong one

With multiple NVIDIA GPUs, Ollama uses them in the order CUDA enumerates them, which isn’t always the card you want (e.g. it grabs the iGPU or a small secondary card). Pin it explicitly:

# use only GPU index 1
CUDA_VISIBLE_DEVICES=1 ollama serve

For AMD cards the equivalent is HIP_VISIBLE_DEVICES=1. Find the indices with nvidia-smi -L. If you’re seriously considering two cards, our multi-GPU NVLink vs PCIe piece covers when a second card actually helps (and when it doesn’t).

Cause 7: AMD GPUs and ROCm

AMD cards use ROCm instead of CUDA, and the support list is narrower. If your Radeon shows 100% CPU, ROCm either isn’t installed or your card isn’t on the supported list. Use HIP_VISIBLE_DEVICES to select the card, and confirm ROCm sees it with rocminfo. Our ROCm 7.2 on Ubuntu setup guide walks through the full install — AMD on Linux is workable in 2026, but it’s more setup than NVIDIA.

Read the log — it tells you exactly what happened

When you’re stuck, stop guessing and read the GPU discovery log. Enable verbose logging by setting OLLAMA_DEBUG=1, restart the server, and load a model.

Log locations:

Linux (systemd): journalctl -u ollama --no-pager | tail -100
Linux/macOS (manual run): /tmp/ollama.log
Windows: open Explorer to %LOCALAPPDATA%\Ollama — the current log is server.log, older ones are server-1.log, etc.

Search the output for lines mentioning gpu, cuda, library, or inference compute. A healthy boot shows something like inference compute ... library=cuda ... total="24.0 GiB". If you see no compatible GPUs were discovered or it never mentions CUDA, you’ve confirmed a detection failure and can go back to Cause 1, 2, or 5 with confidence instead of trial and error.

One Linux-specific quirk worth knowing: after a driver update, the UVM kernel module can get into a bad state and Ollama drops to CPU until you reload it:

sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm

When the honest answer is “your hardware can’t do this”

Sometimes nothing is misconfigured — your GPU just isn’t big enough for the model you want, and the CPU/GPU split is the math working against you. If you’re trying to run a 70B model on a 12GB card, no setting fixes that.

Two realistic options: drop to a model that fits (a 14B or a MoE like Qwen3-30B-A3B runs well on 16–24GB), or rent a bigger GPU by the hour. For occasional large-model work, RunPod rents 24GB and 48GB cards cheaply enough that buying a second GPU rarely pays off unless you’re running them daily — we did the full break-even math in RunPod vs local GPU.

For the broader picture of which model fits which card, our Ollama vs LM Studio vs llama.cpp comparison and the RTX 5060 Ti Ollama benchmarks give you real tokens/sec to set expectations. If you’re coding with these models, our sister site aicoderscope.com covers the editor integrations, and aifoss.dev tracks the open-source tooling side.

FAQ

How do I know if Ollama is using my GPU? Run ollama ps while a model is loaded. The PROCESSOR column shows 100% GPU, 100% CPU, or a split. Confirm with nvidia-smi during generation — you should see an Ollama process holding VRAM.

Why is Ollama using CPU even though I have an NVIDIA GPU? Almost always the driver (too old or installed after Ollama), a WSL2/Docker passthrough gap, or the model not fitting in VRAM. Reinstall Ollama after the driver works, and check ollama ps for a CPU/GPU split.

How much faster is GPU than CPU in Ollama? For Llama 3.1 8B, an RTX 3060 does about 42 tok/s versus roughly 8–14 tok/s CPU-only — a 3–5× difference. The gap widens with larger models.

Does Ollama need CUDA installed separately? No. Ollama bundles the CUDA runtime it needs. You only have to install the NVIDIA driver (531+ on Windows, 535+ on Linux). Don’t install a separate CUDA toolkit just for Ollama.

What’s the minimum GPU for Ollama? Compute capability 5.0 (Maxwell / GTX 900 series) and up. For useful speeds on 7–8B models, an 8GB card like an RTX 3060 is the practical floor.

My GPU works but it’s still slow — now what? If ollama ps says 100% GPU, detection is fine. Check your quant (use Q4_K_M), reduce context length, and confirm the model fits without a split. A correctly-loaded 8B model should clear 30 tok/s on most modern cards.

Sources

Last updated June 10, 2026. Driver versions and Ollama behavior change between releases; verify against the official docs for your installed version.

Recommended Gear

RTX 3060 12GB — the practical entry point for GPU-accelerated Ollama; ~42 tok/s on Llama 3.1 8B.
RTX 3090 24GB — the used-market pick when 8–16GB keeps forcing CPU/GPU splits on bigger models.

Was this article helpful?