WSL 3 GPU Passthrough for Local AI on Windows in 2026: Near-Native Ollama, llama.cpp, and PyTorch

wslwindowslocal-llmollamagpunvidia

TL;DR: WSL 3, previewed at Microsoft Build 2026, swaps the heavy Hyper-V backend for a paravirtualized machine that gives Linux apps GPU and NPU access at within 3-5% of bare-metal Linux speed. If you already run Ollama in WSL 2 on an NVIDIA card, the practical gain is small — WSL 2 was already within ~5%. The real story is NPU passthrough, and that ships Intel/Qualcomm-only at launch.

What you’ll be able to do after this guide:

  • Run Ollama, llama.cpp, and PyTorch inside Linux on Windows with full GPU acceleration and no separate Linux driver install.
  • Understand whether WSL 3 is worth chasing on the Insider channel, or whether your current WSL 2 setup is already fast enough.
  • Avoid the single most common mistake that drops you to CPU-only inference (and costs you ~10× the tokens/sec).

Honest take: For an NVIDIA GPU owner, WSL 2 today already runs Ollama within 5% of native — WSL 3 is a nice-to-have, not a reason to flash an Insider build. The people who should actually care are Copilot+ laptop owners who finally get NPU passthrough.

What Microsoft actually announced

At Build 2026 (June 2, 2026), Microsoft previewed WSL 3. The headline change is architectural: it replaces the Hyper-V VM backend that WSL 2 has used since 2020 with a lighter paravirtualized machine, and routes GPU and NPU access through DirectML 2.0. Microsoft’s claim is that PyTorch, CUDA, and JAX workloads run inside WSL 3 at within 3-5% of bare-metal Linux speed.

That 3-5% number is the one to anchor on, because it reframes the whole pitch. WSL 2’s GPU passthrough was never the bottleneck people assumed it was — for GPU-accelerated inference, WSL 2 already lands within roughly 5% of native Windows Ollama. So for a discrete NVIDIA GPU, WSL 3 is closing a gap that was already small. The genuinely new capability is NPU passthrough, which WSL 2 never had at all.

WSL 3 is available now through the Windows Insiders program and will roll out via Windows Update later, the same way WSL 2 updates have always shipped.

NPU passthrough is the real change — and it’s not for everyone yet

NPU passthrough at launch is limited to Copilot+ class silicon:

  • Qualcomm Snapdragon X Elite / X Elite 2 — Hexagon NPU
  • Intel Meteor Lake / Lunar Lake — Core Ultra NPU

AMD Ryzen AI support is deferred to a later date. The minimum bar for NPU passthrough is a 40 TOPS NPU, which matches the Copilot+ hardware floor. Machines below that, or with no qualifying NPU, still get the GPU improvements — they just don’t get NPU access.

Alongside WSL 3, Microsoft shipped DirectML 2.0, which adds better use of AMD’s XDNA 2 architecture, brings Intel Core Ultra Series 3 (50 TOPS) support, and tunes the Phi Silica model across AMD, Intel, and Qualcomm NPUs. The XDNA 2 work in DirectML 2.0 is the hint that AMD NPU passthrough in WSL is a “when,” not an “if.”

One reality check before you get excited about NPU inference: an NPU is not a shortcut to GPU-class tokens/sec. Decode throughput on local LLMs is bound by memory bandwidth, not raw TOPS, which is why Copilot+ laptops post single-digit-to-low-double-digit tokens/sec on 8B models while a discrete card clears 30+. We covered exactly why in NPU vs Discrete GPU for Local LLMs — read it before assuming the NPU in your new laptop replaces a GPU.

Does the 3-5% number matter for your hardware?

Here’s the comparison that actually decides whether WSL 3 is worth chasing:

WSL 2 todayWSL 3 (Insider)Bare-metal Linux
NVIDIA GPU (CUDA)~5% slower than nativewithin 3-5%baseline
NPU accessnoneIntel/Qualcomm onlyvendor-dependent
Setup effortmature, well-documentedpreview, expect rough edgesdual-boot or separate machine
Best foranyone with an NVIDIA card nowCopilot+ laptop NPU usersabsolute max throughput

If you own a discrete NVIDIA GPU, you are in the top row, and the difference between “~5% slower” and “3-5% slower” is inside the noise of run-to-run variance. There is no compelling reason to move to an Insider build for inference speed alone. Keep your stable WSL 2 setup.

If you own a Copilot+ laptop with a qualifying Intel or Qualcomm NPU, WSL 3 is the first time you can drive that NPU from Linux tooling. That’s the upgrade worth the Insider risk — with the caveat above about what NPU throughput actually looks like.

For context on what GPU throughput you’re protecting with that “within 5%” figure: a correctly-loaded 8B model clears about 95 tok/s on a used RTX 3090 and roughly 104 tok/s on an RTX 4090 under llama.cpp. The 3090 lands about 16.6% behind the 4090 on these workloads. A 5% WSL tax on top of either is a few tokens/sec — real, but not workflow-changing. (llama.cpp itself runs about 3-10% faster than Ollama on NVIDIA GPUs, so your engine choice matters more than your virtualization layer.)

Setting it up: WSL 2 today, WSL 3 on Insider

The setup flow is nearly identical between WSL 2 and WSL 3 — the passthrough plumbing changed underneath, but the user-facing commands didn’t. This is the path that works on a stable Windows 11 machine right now with WSL 2, and the same steps apply once WSL 3 reaches your channel.

1. Install WSL and a distro

From an elevated PowerShell:

wsl --install -d Ubuntu-24.04
wsl --update
wsl --status

wsl --update pulls the latest kernel. On the Insider channel with WSL 3 available, the same --update is what flips you onto the new backend; check wsl --version afterward to confirm.

2. Install the Windows GPU driver — and ONLY the Windows driver

This is the step that trips up almost everyone. The CUDA libraries are exposed inside Linux automatically through /usr/lib/wsl/lib/. You do not install a Linux NVIDIA driver inside the distro. Installing a separate Linux driver inside WSL is the most common way people break passthrough and silently fall back to CPU.

Install the normal NVIDIA Windows driver (a recent Game Ready or Studio driver is fine — WSL-specific CUDA drivers are no longer required), then verify from inside WSL:

nvidia-smi

If you see your GPU and its VRAM, passthrough is live. If nvidia-smi is missing or errors, you either skipped the Windows driver or installed a Linux driver on top of it.

3. Install the CUDA toolkit (only if you compile)

For Ollama you don’t need the full toolkit — its bundled runtime is enough. If you compile llama.cpp or build PyTorch extensions, install the WSL-Ubuntu CUDA toolkit, which is the keyring package that does not ship a display driver:

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-6

4. Run Ollama and confirm it’s on the GPU

curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.1:8b

In a second terminal, while a prompt is generating:

ollama ps

Expected output on a working GPU setup:

NAME           ID            SIZE     PROCESSOR    UNTIL
llama3.1:8b    365c0bd3c000  6.2 GB   100% GPU     4 minutes from now

100% GPU is the goal. If you see 100% CPU or a CPU/GPU split, the model didn’t fully offload. A correctly offloaded 8B model should clear 30 tok/s on any modern card and much more on a 3090/4090.

The error you’ll actually hit, and the fix

The single most common failure isn’t exotic — it’s the GPU disappearing from WSL after a Windows update or a sleep cycle, leaving Ollama on CPU at a tenth of the speed. The symptom in ollama ps:

NAME           ID            SIZE     PROCESSOR    UNTIL
llama3.1:8b    365c0bd3c000  6.2 GB   100% CPU     4 minutes from now

Walk it in this order:

  1. nvidia-smi inside WSL returns nothing. Your passthrough is down. Confirm the Windows NVIDIA driver is current, then run wsl --shutdown from PowerShell and reopen the distro. This re-initializes the GPU device mapping more often than any other single fix.
  2. nvidia-smi works but Ollama still says CPU. The model is bigger than free VRAM and Ollama spilled to system RAM. Check VRAM headroom and either pick a smaller quant or cap context. A 14B model at Q4 needs roughly 8-9GB of VRAM; if your card is full, drop to an 8B or reduce num_ctx.
  3. You installed a Linux NVIDIA driver inside WSL. Remove it (sudo apt-get remove --purge '^nvidia-.*') and rely solely on the Windows driver plus /usr/lib/wsl/lib/.

If you’re stuck on CPU-only inference more broadly — across Windows, WSL, and native Linux — our dedicated walkthrough covers every case: Ollama Not Using GPU? Fix CPU-Only Inference.

Does WSL 3 finally make Windows a first-class home lab OS?

Mostly, yes — but WSL 2 already got Windows 80% of the way there for NVIDIA owners, and few people noticed. The honest framing: WSL has been a viable local-AI host for a couple of years, the ~5% GPU tax was always overstated as a blocker, and WSL 3 trims that tax to noise while adding NPU access that most desktop builders won’t use.

The case for staying on Windows + WSL instead of dual-booting Linux is the same as it’s been: you keep your games, your drivers, and your day-to-day apps, and you get a real Linux environment for the AI stack one command away. The case for bare-metal Linux is unchanged too — if you’re chasing the absolute last few percent of throughput on a dedicated inference box, or running multi-GPU setups where every driver quirk matters, native Linux still wins. For a multi-card rig, see our multi-GPU NVLink vs PCIe guide.

No GPU at all, or a laptop that can’t fit one? This is the one case where the answer isn’t WSL. Renting a cloud GPU by the hour is cheaper than a discrete card until you’re running inference daily for months — RunPod gives you a real Linux box with a 4090 or better and skips the virtualization layer entirely. If your local AI work is mostly coding-assistant usage, our sister site has a breakdown of the best AI coding tools that pair with a local or cloud model.

FAQ

Is WSL 3 available on stable Windows 11 yet? No. As of June 2026 it’s a preview on the Windows Insiders program, with a Windows Update rollout to follow. WSL 2 remains the stable path.

Will WSL 3 make my NVIDIA GPU faster? Marginally. WSL 2 already runs GPU inference within about 5% of native; WSL 3 closes that to 3-5%. The difference is inside run-to-run variance — don’t switch to an Insider build for this alone.

Can I use my laptop’s NPU for LLMs in WSL 3? Only on Copilot+ machines with Intel Meteor Lake/Lunar Lake or Qualcomm Snapdragon X Elite silicon at launch (40 TOPS minimum). AMD Ryzen AI is deferred. And note that NPU tokens/sec on LLMs is far below a discrete GPU because decode is memory-bandwidth-bound.

Do I need to install CUDA drivers inside WSL? No — and you shouldn’t. Install only the Windows NVIDIA driver. The Linux-side CUDA libraries are exposed automatically via /usr/lib/wsl/lib/. Installing a Linux driver inside WSL breaks passthrough.

Is WSL fast enough for serious local LLM work, or should I dual-boot Linux? For single-GPU NVIDIA inference, WSL is fine — the overhead is a few percent. Dual-boot or a dedicated Linux box only pays off for multi-GPU rigs or when you need the absolute maximum throughput on a 24/7 inference server.

Sources

Last updated June 15, 2026. WSL 3 is a preview; features and supported silicon may change before general availability. Verify current Insider build details before relying on them.

  • RTX 3090 — used 24GB card, ~95 tok/s on an 8B model, the value pick for a WSL home lab.
  • RTX 4090 — ~104 tok/s on 8B and ~16% faster than the 3090 where you need the headroom.

Was this article helpful?