Jun 13, 2026

NPU vs Discrete GPU for Local LLMs in 2026: Why Computex Laptops Lose on Tokens/Second Despite the TOPS Claims

By RunAIHome Team · 10 min read

npugpulocal-llmrtx-3090snapdragonryzen-aihardware2026

TL;DR: NPUs are marketed on TOPS — 45 to 80 of them on 2026 laptops — but local LLM decode is bottlenecked by memory bandwidth, not compute, so a Snapdragon X Elite NPU manages roughly 10 tokens/sec on an 8B model while a used RTX 3090 does ~95 on a 7B. The NPU wins exactly one fight: running small models for hours on battery without melting your laptop. For everything else, the discrete GPU is still the answer.

	Laptop NPU (Snapdragon / Ryzen AI / Intel)	Integrated GPU (Strix Halo iGPU)	Used RTX 3090 tower
Best for	Battery-bound small-model inference	Mid-size models, unified memory	Fastest tokens/sec under 24GB
8B model speed	~10–20 tok/s	~48–61 tok/s	~95 tok/s (7B)
Memory bandwidth	~120–256 GB/s (shared)	256 GB/s (~215 real)	936 GB/s
Power draw	~35W	~80W	~285W
The catch	Software support barely exists	NPU sits idle, iGPU does the work	Plugged in, not portable

Honest take: Buy an NPU laptop because you want a quiet, cool, all-day machine that happens to run a 3B model offline — not because the 50-TOPS sticker implies it competes with a GPU. It doesn’t, and the gap is physics, not drivers.

Every AI PC at Computex 2026 led with the same number: TOPS. Trillions of operations per second. Qualcomm’s Snapdragon X2 Elite Hexagon NPU hits 80 TOPS. AMD’s Ryzen AI Max “Strix Halo” XDNA 2 NPU does 50. Intel’s Panther Lake pushes its NPU to 50. Microsoft baked a 40-TOPS floor into the Copilot+ PC spec, and as of early 2026 over half of new laptops ship an NPU that clears it.

If you run local LLMs, none of those numbers tell you what you actually want to know: how fast does it generate tokens? And when you measure that, the marketing story falls apart. The NPU is not a small GPU. It’s a different tool, good at a narrow job, and local text generation mostly isn’t that job.

Why TOPS doesn’t predict tokens/sec

LLM inference has two phases, and they stress hardware differently.

Prefill (reading your prompt) is compute-bound. The model processes every input token in parallel, doing huge matrix multiplies. This is where TOPS matters, and where NPUs actually look respectable.

Decode (writing the response) is memory-bound. To generate each new token, the hardware has to stream the entire model’s weights out of memory — once per token. An 8B model at 4-bit quantization is roughly 4.5–5 GB. To produce 20 tokens per second, you must read those 5 GB twenty times a second: ~100 GB/s of sustained bandwidth just to keep up, before any overhead.

That’s the whole game. Decode speed is approximately memory bandwidth ÷ model size. Compute capacity — TOPS — is almost irrelevant once you have enough of it, because the arithmetic units spend most of their time waiting for weights to arrive.

So the right spec to compare isn’t TOPS. It’s GB/s.

A laptop NPU shares system memory: LPDDR5x at roughly 120–256 GB/s, and it competes with the CPU and iGPU for that bandwidth.
A used RTX 3090 has 936 GB/s of dedicated GDDR6X — confirmed theoretical max, and the reason it remains the value king for local AI.

That’s a 4–8× bandwidth advantage for the GPU before you account for software maturity. The token-rate gap below isn’t a mystery; it’s that ratio showing up exactly where the math says it will.

The actual benchmarks

Here’s what real measurements show, not spec sheets.

Snapdragon X Elite (45-TOPS Hexagon NPU). Running Llama-SEA-LION 8B (w4a16) through Qualcomm’s QNN/Genie runtime — the native NPU path, not a fallback — the device produces about 10 tokens/sec. Smaller ~4B models clear a few more. Qualcomm’s own marketing claim of “30 tokens/sec” applies to specifically optimized, smaller models under ideal conditions. And here’s the catch that bites everyone: Ollama, llama.cpp, LM Studio, and text-generation-webui all run CPU-only on these ARM chips. To touch the NPU you need NexaSDK, AnythingLLM’s bundled QNN engine, or hand-built QNN context binaries. The mainstream local-LLM stack ignores the NPU entirely.

AMD Ryzen AI Max+ 395 “Strix Halo” (50-TOPS XDNA 2 NPU). When someone actually benchmarked all three compute units on this chip, the ranking was: GPU fastest, CPU second, NPU slowest. The integrated GPU hits roughly 48–61 tokens/sec on 8B-class models (Phi-3.5 at the top end). The 50-TOPS NPU is, as of mid-2026, largely unused for general LLM inference — AMD’s Lemonade SDK targets it for a short list of specific models, but the iGPU beats it for almost everything. You bought 50 TOPS of NPU and the part doing the work is the GPU sitting next to it.

Intel Lunar Lake NPU. Around 18–20 tokens/sec on typical LLM tasks — better than Snapdragon’s NPU path, still a fraction of a discrete card.

Smartphone reference point. Llama 3 8B (4-bit) runs at about 14.9 tokens/sec inside a phone power envelope, and sustained-load testing shows the throttling problem clearly: an iPhone 16 Pro peaks at 40.35 tok/s but settles to 22.56 tok/s — a 44% drop once the chip heats up. NPUs in thin laptops face the same thermal wall.

Used RTX 3090 (no NPU at all). Roughly 95 tokens/sec on a 7B model at Q4_K_M in llama.cpp, and it’ll hold that rate indefinitely because it has a heatsink the size of the entire laptop. For comparison, an RTX 4090 does about 127 tok/s on 8B.

Line them up and the “AI PC” framing collapses:

Hardware	Native LLM path	8B-class tokens/sec	Sustained?
Snapdragon X Elite NPU	QNN/Genie only	~10	Thermal-limited
Intel Lunar Lake NPU	OpenVINO	~18–20	Thermal-limited
Strix Halo iGPU	ROCm/Vulkan	~48–61	Better (desktop)
Used RTX 3090	CUDA, everything	~95 (7B)	Yes
RTX 4090	CUDA, everything	~127	Yes

Where the NPU actually wins: tokens per watt

This isn’t a hit piece on NPUs. They win a real fight — just not the one the box advertises.

Power efficiency is where the silicon was designed to shine. Measured head-to-head, an NPU consistently runs at less than half the GPU’s power for comparable small-model work — on the order of 35W vs 75W in laptop testing, with 40–60% energy savings cited across multiple sources. A discrete RTX 3090 pulls around 285W under load. You cannot put a 285W card in a fanless ultrabook and expect to close the lid.

So the NPU’s job is efficient, always-available, offline inference of small models:

A 3B model summarizing your emails on battery while you’re on a plane.
Live transcription or a writing assistant that runs for hours without spinning a fan or draining the battery in 40 minutes.
Background OS features (the actual reason Microsoft mandated the NPU) — Recall, Studio Effects, live captions — that need to be cheap enough to leave on permanently.

If your use case is “run a small model occasionally, all day, unplugged,” the NPU laptop is genuinely the right tool, and a 285W tower is the wrong one. That’s the narrow, legitimate win. It’s about tokens per watt, not tokens per second.

So what should you actually buy?

You want the fastest local LLM experience, period. Get a used RTX 3090. At ~$1,050 on eBay in June 2026 (averages have hovered between $966 and $1,189 this month), its 936 GB/s and 24GB still beat every consumer alternative under 24GB on tokens/sec, and CUDA means every tool just works. We’ve made this case before in our used RTX 3090 value breakdown, and the Computex hardware didn’t change it.

You need models bigger than 24GB and don’t want a multi-GPU tower. A Strix Halo box (Ryzen AI Max+ 395, 128GB unified) is the move — but understand you’re buying it for the 128GB capacity, not the 50-TOPS NPU. The iGPU does the inference, and 70B-class models run at single-digit tokens/sec because of bandwidth. See our Strix Halo deep dive for the full numbers.

You want a laptop that runs small models offline, all day, silently. Now the NPU laptop makes sense — Snapdragon X2 Elite, Lunar Lake, or Panther Lake. Just calibrate expectations to ~10–20 tok/s on an 8B model through the correct runtime, and budget time to get past the CPU-only default in the popular tools. Our AMD Lemonade guide walks through the one consumer stack that actually targets the NPU today.

You don’t want to own hardware at all. For occasional heavy jobs — fine-tuning, 70B inference, batch work — renting a cloud GPU is cheaper than buying. RunPod rents an RTX 4090 or larger by the hour, and our rent-vs-buy math shows the crossover point.

FAQ

Does the NPU help at all when I run Ollama or LM Studio on a Copilot+ laptop? No. As of mid-2026, Ollama, llama.cpp, LM Studio, and text-generation-webui all run CPU-only on Snapdragon ARM chips — they don’t touch the Hexagon NPU. You need a QNN-aware runtime (NexaSDK, AnythingLLM’s bundled engine) to use it, and even then you’re limited to models that have been converted to QNN context binaries.

If decode is memory-bound, why do vendors even advertise TOPS? Because TOPS is a big, simple number, and it’s genuinely relevant for prefill (processing long prompts) and for non-LLM AI features like image upscaling, background blur, and OCR — the workloads the NPU was designed for. It just doesn’t predict token generation speed, which is what local-LLM users care about.

Will NPU software catch up and close the gap? Software maturity will help — better runtimes could push NPUs closer to their bandwidth ceiling. But the ceiling itself is the shared system memory bandwidth (~120–256 GB/s), which is fixed by the laptop’s RAM, not the NPU. No driver update gives an LPDDR5x laptop the 936 GB/s of a 3090’s GDDR6X. The gap narrows; it doesn’t disappear.

Is a 50-TOPS NPU pointless then? Not at all — it’s the difference between AI features draining your battery and AI features you can leave on all day. It’s just the wrong metric for “how fast does my chatbot type.” Judge an NPU on tokens per watt and always-on convenience, not peak tokens per second.

What about the NPU + GPU “hybrid” systems I’ve read about? Those exist in research and data-center contexts (a GPU-NPU hybrid showed ~3.9× efficiency gains on Llama 3 8B vs an H100 for long-context work), but they’re not what’s in a consumer Copilot+ laptop. On your machine, the NPU and iGPU don’t cooperate on a single LLM request today.

Sources

Last updated June 13, 2026. Prices and specs change; verify current rates before purchasing.

Was this article helpful?