Why Local LLMs Got Good in 2026: Multi-Token Prediction, Speculative Decoding, and the MoE Efficiency Leap

local-llmmoespeculative-decodingmulti-token-predictioninference2026

TL;DR: Local models didn’t just get bigger in 2026 — they got faster at the same quality. Three techniques did the heavy lifting: multi-token prediction (~1.8× throughput, lossless), speculative decoding (1.5–3× on consumer GPUs), and sparse MoE routing (35B of weights, only 3B active per token). Together they put GPT-4-class output on a single 24GB GPU at usable speeds.

Multi-token predictionSpeculative decodingSparse MoE
What it speeds upDecode (built into the model)Decode (runtime trick, any model)Decode + memory pressure
Typical gain~1.8× (85–90% accept rate)1.5–3× depending on accept rate3B active vs 35B total → ~30B-dense feel at 3B-dense speed
The catchModel must be trained with itNeeds a good small draft model + extra VRAMStill needs VRAM for all weights

Honest take: None of these are magic — they all trade extra compute or memory for fewer sequential steps. But stacked together on a model like Qwen3.6 35B-A3B, they’re the reason a used RTX 3090 now does everyday coding and writing that needed a cloud API call 18 months ago.


The thing that actually changed

Ask anyone who ran local models in 2024 and they’ll tell you the same story: a 7B model was fast but dumb, and a 70B model was smart but unusably slow on consumer hardware. You picked your poison. The honest answer to “should I run this locally or just call an API?” was usually “call the API.”

That tradeoff broke in 2026, and it broke for a specific, technical reason. The models people run at home — Qwen3.6, Gemma 4, GPT-OSS, Nemotron-Cascade — aren’t just bigger or better-trained than their predecessors. They’re architected so the expensive part of generation (decoding one token at a time) costs far less per token than the raw parameter count suggests.

There are three distinct techniques doing this, and they’re often confused because they all attack the same bottleneck. This article separates them, shows what each one actually buys you, and explains why the combination — not any single one — is what closed the gap to cloud.

If you just want to know which model to pull for your GPU, the open-source LLM shootout has the picks by VRAM tier. This article is the why behind those picks.


The bottleneck: why decoding is slow in the first place

Every autoregressive LLM generates text one token at a time. To produce token N, the model runs a full forward pass using tokens 1 through N−1 as context. Then it appends token N and runs another full forward pass for token N+1. The passes are strictly sequential — you can’t start computing token N+1 until you know token N.

Here’s the part that surprises people: that forward pass is memory-bandwidth-bound, not compute-bound. For a single user generating one token, the GPU spends most of its time reading the model’s weights out of VRAM, not doing math. A dense 32B model at Q4 has roughly 18GB of weights, and the GPU has to stream a large fraction of those through its memory bus for every single token.

That’s why a used RTX 3090, with 936 GB/s of memory bandwidth, does roughly 95 tok/s on a 7B model but only a fraction of that on a dense 32B — the model is the same architecture, there are just more weights to read each step. The short version, which the VRAM-tier guide leans on throughout: decode speed tracks bandwidth ÷ active-weight-size far more than it tracks raw TFLOPS.

All three 2026 techniques are different answers to one question: how do we produce more tokens without reading all the weights that many times in sequence?


Technique 1: Sparse MoE — read fewer weights per token

Mixture-of-Experts is the most consequential of the three, and it’s the easiest to understand once you frame it around bandwidth.

A dense 32B model reads ~32B parameters’ worth of weights for every token. A Mixture-of-Experts model splits most of its parameters into “expert” sub-networks and adds a small router that picks which experts to use for each token. The model still stores all the weights, but it only reads the active ones per token.

Two of the most-run local models in 2026 are built this way:

  • Qwen3.6 35B-A3B: 35 billion total parameters, but only ~3 billion active per token.
  • Gemma 4 26B-A4B: 26 billion total, ~4 billion active per token.

The “A3B” / “A4B” suffix literally means “active 3 billion / active 4 billion.” That naming convention is itself a sign of how central this idea became.

The payoff shows up directly in benchmarks. On an RTX 4090, a dense 32B model at Q4 lands near 60 tok/s, while a ~30B MoE model with 3B active runs around 110 tok/s at 32K context — nearly double, from a model with more total parameters. Reported numbers vary by runtime and context length (one careful Q4_K_M measurement put Qwen3.5 35B-A3B at ~78 tok/s decode on a 4090), but the direction is consistent: you get the throughput of a ~3B model with the knowledge capacity of a much larger one.

The catch — and it’s a real one — is VRAM. MoE saves bandwidth, not capacity. You still have to hold all 35B parameters in memory, so Qwen3.6 35B-A3B needs a 24GB card just like a dense model of that size would. MoE makes the smart model fast; it doesn’t make it fit on less. That distinction trips up a lot of buyers, which is why the VRAM-tier guide leads with total size, not active size.


Technique 2: Speculative decoding — guess ahead, verify in parallel

Speculative decoding is a pure runtime trick. It doesn’t change the model’s weights or quality at all — it changes how you run inference.

The idea: pair your big “target” model with a small, fast “draft” model. The draft model cheaply generates a short run of candidate tokens — say the next 4. Then the target model does one forward pass that verifies all 4 candidates at once (verification is parallel; generation is not). Every candidate the target agrees with is accepted for free; the first disagreement is corrected, and you start the next round from there.

The crucial property is that the output is bit-for-bit identical to what the target model would have produced alone. Speculative decoding is lossless — tuning how many tokens the draft proposes changes speed only, never the text. That’s what separates it from quantization or distillation, which trade quality for speed.

Real-world gains:

  • General reports put speculative decoding at 2–3× speedup with no quality change.
  • In llama.cpp specifically, users see 1.5×–3× tokens/sec depending on how often the draft model’s guesses are accepted.
  • NVIDIA has demonstrated up to 3.6× throughput on H200-class hardware with tuned draft models.

The acceptance rate is everything. If your draft model agrees with the target 80% of the time, most of your speculative tokens stick and you get a big speedup. If the draft is poorly matched and only agrees 30% of the time, you’ve paid for draft passes that get thrown away and you might even go slower. This is why picking a draft model from the same family (e.g. a 0.5B Qwen drafting for a 32B Qwen) matters — they make similar predictions.

By late 2025 this moved from research curiosity to production default: vLLM and TensorRT-LLM ship native support, and llama-server (the backbone of many local setups) supports several implementations. The cost is extra VRAM for the draft model and some tuning of the speculative length — a small price for a free 1.5–3×.


Technique 3: Multi-token prediction — bake the drafting into the model

Multi-token prediction (MTP) is the technique people most often confuse with speculative decoding, because the mechanism at inference time looks similar. The difference is where the “drafting” comes from.

In plain speculative decoding, the draft is a separate small model bolted on at runtime. MTP, popularized by DeepSeek-V3, trains the model itself to predict several future tokens from a single hidden state. The model carries lightweight MTP “heads” that propose the next token or two, and the main model verifies them in parallel — self-drafting, no second model required.

The numbers DeepSeek reported are why this caught on:

  • The second-token prediction is accepted 85–90% of the time across generation topics.
  • That high acceptance translates to roughly 1.8× decode TPS in their inference serving.
  • Independent serving stacks (SGLang on AMD Instinct, for example) measured 1.25×–2.11× speedups depending on the workload.

Because the drafting heads are trained jointly with the model, their guesses match the target distribution far better than a generic small draft model would — hence the high accept rate. And like speculative decoding, MTP is lossless: the verification step guarantees the output matches greedy decoding from the base model.

MTP’s downside is that you can’t add it to a model after the fact — it has to be trained in. That’s fine when you’re pulling DeepSeek or another MTP-trained model, but it means the technique only helps for models whose authors chose to include it. Tooling is catching up: llama.cpp added MTP support in 2026, so GGUF builds of MTP-trained models can finally use the heads instead of stripping them out.


Why the combination is the real story

Each technique alone is a respectable speedup. Stacked, they compound, and that compounding is what actually moved local inference past the “good enough to replace an API” line.

Walk through a concrete 24GB build running an MoE model with the other two techniques layered on:

  1. MoE drops your per-token weight reads from ~32B to ~3B. A model that would crawl at 60 tok/s dense now runs ~110 tok/s.
  2. Speculative decoding or MTP then multiplies that number by accepting multiple tokens per verification pass — call it a conservative 1.6× in practice.
  3. The net result is a smart, ~30B-knowledge model generating well above 150 tok/s on a single consumer card.

For comparison, GPT-OSS 20B (another MoE, ~3.6B active) already hits 225 tok/s on an RTX 4090 at 8K context — see the GPT-OSS hardware guide for the full benchmark table. Nemotron-Cascade 2, with 30B total / 3B active, does 187 tok/s on an RTX 3090. These aren’t toy 7B models — they’re models that hold their own against cloud APIs on coding and writing, running faster than you can read.

The human-factor number that makes this matter: comfortable reading speed is roughly 7–10 tok/s. Anything above ~30 tok/s feels instant. So a local model at 150+ tok/s isn’t just “fast enough” — it’s leaving headroom for agentic workflows that generate far more tokens than a human reads, like multi-step tool use or code refactoring across many files.


What this does not fix

It’s worth being honest about the limits, because the hype around “local is solved now” overshoots.

Prefill (prompt processing) is a different bottleneck. All three techniques speed up decode — generating new tokens. Reading a huge prompt (a long codebase, a 128K-token document) is compute-bound and largely untouched by MoE or speculative tricks. Time-to-first-token on a big context is still the real wait on consumer hardware.

VRAM capacity is still the wall. MoE made smart models fast, not small. If a model’s total weights don’t fit in your VRAM, none of this helps — you’re spilling to system RAM and watching speeds collapse. The quantization guide is still where the “does it fit?” question gets answered.

Speculative gains are workload-dependent. Predictable text (boilerplate code, structured output) gets high accept rates and big speedups. Highly creative or unusual text gets lower accept rates and smaller gains. The “1.8×” and “2–3×” figures are averages, not guarantees.

Bigger isn’t always running locally. The frontier MoE models — Llama 4 Maverick, DeepSeek V4, MiniMax M3 — are 400B+ total parameters. Sparse activation makes them servable but doesn’t make them fit on a 24GB card. For those, the API is still the answer.


FAQ

Is speculative decoding the same as multi-token prediction? No. Speculative decoding uses a separate small draft model at runtime and works with almost any target model. MTP trains the drafting ability into the model itself via extra prediction heads. Both verify candidates in parallel and both are lossless, but MTP only works for models trained with it, while speculative decoding can be added to any model if you have a compatible draft model.

Does MoE make models use less VRAM? No — this is the most common misconception. MoE reduces the bandwidth read per token (only active experts are read), which boosts speed. But all experts must be stored in VRAM, so a 35B MoE needs the same memory as a 35B dense model. It makes smart models fast, not small.

Are these speedups lossless or do they hurt quality? Speculative decoding and MTP are lossless — the verification step guarantees output identical to running the base model alone. MoE is an architectural choice made at training time, so there’s no quality “loss” at inference; the model is what it is. Quantization (a separate topic) is the technique that actually trades quality for size.

Which local model benefits from all three? MoE models trained with MTP heads — DeepSeek’s family is the clearest example — get all three when run on a stack that supports MTP and speculative decoding (recent vLLM, llama.cpp, SGLang). Pure-MoE models like Qwen3.6 35B-A3B get MoE’s bandwidth win plus speculative decoding if you add a draft model.

Do I need a new GPU to use these? No. Speculative decoding and MoE both run on existing cards — a used RTX 3090 handles them fine. The gains come from software and model architecture, not new silicon. If anything, these techniques are why a 2-3-year-old GPU is still viable for serious local AI in 2026.

What about renting a GPU to test bigger MoE models first? If you want to try a 100B+ MoE before committing to hardware, renting by the hour on RunPod is cheaper than buying multiple cards just to find out the model is overkill for your use case.


The bottom line

Local LLMs got good in 2026 not because of one breakthrough but because three independent efficiency techniques matured at once and started shipping in the models and runtimes home-labbers actually use. MoE cut the weights read per token. Speculative decoding and multi-token prediction let the GPU produce several verified tokens per pass instead of one. Stacked on a model like Qwen3.6 35B-A3B or GPT-OSS 20B, they turn a single 24GB consumer card into something that generates GPT-4-class text faster than you can read it.

The wall that’s left is VRAM capacity and prefill on long contexts — real limits, but narrower ones than the “fast or smart, pick one” tradeoff that defined local AI two years ago. For which specific model to run on your hardware, start with the VRAM-tier guide and the open-source shootout.

  • RTX 3090 — the 24GB used-market value pick that runs MoE models and speculative decoding without a new purchase.

Sources

Last updated June 18, 2026. Benchmarks vary by runtime, quantization, and context length; verify current numbers for your exact setup before buying hardware.

Was this article helpful?