Jun 11, 2026

EXO Framework in 2026: Can You Pool RTX 3090s to Beat a DGX Spark? The Honest Distributed-Inference Reality

By RunAIHome Team · 11 min read

distributed-inferencelocal-llmgpurtx-3090apple-siliconexo

TL;DR: EXO turns a pile of computers into one big memory pool so you can load models that won’t fit on a single card. That’s real and useful. But the viral claim — “three used RTX 3090s beat a $3,999 DGX Spark at 3× the throughput” — does not survive contact with the data. EXO’s GPU support is strongest on Apple Silicon; on NVIDIA it’s a community fork, and distributed inference adds capacity, not speed.

	EXO on Apple Silicon	EXO on NVIDIA (exo-cuda fork)	DGX Spark (single box)
Best for	Pooling Mac unified memory to run 400B–671B models	Experimenters who already own multiple NVIDIA cards	One-box 120B inference with zero networking
Cost (June 2026)	4× M3 Ultra Mac Studio = $40k+	3× used RTX 3090 ≈ $3,200–$3,600 + platform	$3,999–$4,699
Real throughput	DeepSeek V3.1 671B: 32.5 tok/s on 4 nodes	Lower than a single native-CUDA GPU (tinygrad backend)	gpt-oss-120b: ~38.5 tok/s
The catch	Costs more than most home labs will ever spend	GPU support not in the official build	Capped at ~128GB; no expansion

Honest take: If you want to run a model that doesn’t fit on one card, EXO is a genuinely clever tool — and it’s at its best clustering Macs. If your real goal is more tokens per second on NVIDIA hardware, pool your 3090s with llama.cpp or vLLM instead, and don’t expect distributed inference to multiply your speed.

What EXO actually is

EXO (from exo-explore) is an open-source framework that connects multiple devices on your network into a single inference cluster. Instead of needing one GPU big enough to hold an entire model, EXO splits the model into shards and spreads those shards across whatever hardware you have — Mac Studios, desktops, laptops, even phones — pooling their memory into one virtual machine. It exposes an OpenAI-compatible API, so tools like Open WebUI and Continue talk to it without modification.

The pitch writes itself: you’re “one GPU short” of running a 70B model, so instead of buying a $2,000 card, you network two machines you already own and run it across both. The project is actively maintained — the latest release is v1.0.71 (April 23, 2026), built on 2,300+ commits.

That part is all true. The problem is what happens when the marketing turns “you can run bigger models” into “you can beat a dedicated AI box on speed for cheap.” Those are different claims, and only one of them holds up.

The viral claim, line by line

The version of this story making the rounds on r/LocalLLaMA goes something like: “Three used RTX 3090s — 48GB pooled — beat a $3,999 DGX Spark at 3× the throughput, 124 tok/s vs 38.5 tok/s on a 120B model.” Let’s take it apart.

“48GB pooled” from three RTX 3090s. Three 24GB cards is 72GB, not 48GB. 48GB is two cards. Small thing, but it tells you the number wasn’t checked.

“$2,400 for three used 3090s.” Not anymore. The 2026 memory shortage that doubled DDR5 and SSD prices dragged used GPU prices up with it. As of June 9, 2026, used RTX 3090 listings on eBay average $1,050–$1,210, with a typical range of roughly $900–$1,500 across hundreds of completed sales. Three of them is $3,150–$3,630 — before you add a motherboard with three usable PCIe slots, a 1,500W power supply to feed roughly 1,050W of GPU draw, and risers or a mining frame to physically fit them. The all-in cost lands above a DGX Spark, not below it.

“3× the throughput, 124 tok/s.” This is the claim that matters, and it’s where the whole thing falls apart. Distributed inference does not work the way the number implies.

Why distributed inference adds capacity, not speed

Here’s the part the clickbait skips. There are two ways to split a model across devices:

Pipeline parallelism assigns each device a contiguous block of layers. A token flows through device 1’s layers, then gets passed over the network to device 2, and so on. For a single request, only one device is busy at a time — the others wait. You get the memory of all of them, but roughly the speed of one, minus network overhead.
Tensor parallelism splits each layer across devices so they work simultaneously. This can speed things up, but it hammers the interconnect with synchronization traffic on every layer. EXO’s own numbers put tensor parallelism at “up to 1.8× on 2 devices and 3.2× on 4 devices” — and that’s a best case on a fast link, not a guarantee.

EXO’s own published benchmarks make the point better than any argument. On a 4-node cluster of M3 Ultra Mac Studios connected with RDMA over Thunderbolt 5 — one of the fastest consumer interconnects that exists — Jeff Geerling measured DeepSeek V3.1 671B at:

Nodes	DeepSeek V3.1 671B (8-bit)
1 node	21.1 tok/s
2 nodes	27.8 tok/s
4 nodes	32.5 tok/s

Four nodes deliver 1.5× the throughput of one — not 4×, not even 2×. And that’s on Thunderbolt 5 RDMA, which Apple says cuts inter-device latency by ~99% versus regular networking. The reason four Macs help at all here is that a single 256GB node can barely hold 671B in 8-bit; the cluster’s real job is fitting the model, and the modest speed bump is a bonus.

Now imagine doing that over gigabit or even 10GbE Ethernet between three desktops, with EXO’s NVIDIA path running on a less-optimized backend. The idea that this configuration produces 124 tok/s — more than 3× what a purpose-built, tightly-integrated DGX Spark does on a similar-sized model — isn’t supported by anything EXO has published. The honest expectation for a 3× 3090 EXO cluster is throughput in the neighborhood of a single 3090, with the upside being the 72GB of combined VRAM.

The NVIDIA asterisk nobody mentions

There’s a bigger problem for the 3090 fantasy: EXO’s official builds don’t run on NVIDIA GPUs.

Per EXO’s own README, on Linux the framework currently “runs on CPU,” with GPU support listed as under development. Its first-class GPU backend is MLX — Apple Silicon. That’s why every headline EXO benchmark is a stack of Mac Studios, not a rack of 3090s.

NVIDIA acceleration exists only through a community fork, exo-cuda by developer Scottcjn, which restores CUDA inference via the tinygrad backend (tinygrad was removed from mainline EXO during the v1 rewrite). It’s been confirmed working on older data-center cards like the Tesla V100 and M40. It’s a legitimate project, but it’s a fork — you’re installing unofficial code, and tinygrad’s CUDA kernels are not as optimized as the native kernels in vLLM or llama.cpp. EXO’s NVIDIA throughput is therefore lower than running the same model on a single GPU that has enough VRAM. The framework’s advantage on NVIDIA is strictly about enabling models that won’t otherwise fit.

If you want to actually pool VRAM across NVIDIA cards today, you don’t need EXO at all.

What you should use to pool NVIDIA GPUs instead

For multiple cards in one box — the realistic home-lab setup — the mature tools are llama.cpp and vLLM.

llama.cpp splits a model’s layers across GPUs with the --tensor-split flag, putting some layers on each card. A 70B Q4 model needing ~40GB simply spreads across two 24GB cards. It’s pipeline-style, so single-request speed is roughly that of one card, but it’s rock-solid and supports mixed hardware. llama.cpp also has an RPC mode for spreading across separate machines on a LAN, much like EXO — but with native CUDA kernels. (For the deeper trade-offs between layer-splitting and NVLink, see our multi-GPU NVLink vs PCIe guide.)

vLLM is the choice when you want throughput from multiple cards, not just capacity. Its tensor-parallel implementation keeps all GPUs working at once and is built for batched, concurrent serving. Multi-3090 vLLM setups have been reported in the 250–350 tok/s range on batched requests — but that’s aggregate throughput across many simultaneous prompts, not the speed of a single conversation. We break down when each engine wins in vLLM vs Ollama.

For reference, a single RTX 3090 on a 70B model at Q2 manages around 10 tok/s — slow, because 70B is too big to be comfortable on 24GB. That’s exactly the situation where pooling a second or third card earns its keep: not to go faster, but to load a bigger or higher-quality quant.

So when does EXO actually make sense?

EXO is the right tool in a narrow, real set of cases:

You own multiple Apple Silicon Macs. This is EXO’s home turf. With Thunderbolt 5 RDMA between M3/M4 Ultra Studios, you can pool hundreds of gigabytes of unified memory and run 235B–671B models that no single consumer machine can hold. It’s expensive (a 4× M3 Ultra cluster runs north of $40,000), but for that money it’s a quiet, power-efficient way to run frontier-class open weights locally.
You have a pile of mismatched devices and a model that won’t fit anywhere. EXO’s heterogeneous sharding genuinely lets a desktop GPU, a laptop, and a Mac cooperate. If the alternative is “can’t run it at all,” a slow run beats no run.
You’re experimenting. Distributed local inference is a fascinating space, and EXO is one of the most approachable ways to learn it. Just go in knowing the official NVIDIA story is “use the fork.”

What EXO is not is a cheat code to turn cheap used GPUs into a DGX Spark killer. For a single user who wants the best tokens-per-second-per-dollar on one model, a single capable card — or a DGX Spark, or a used 3090 for models that fit in 24GB — beats a distributed cluster on both speed and simplicity.

Don’t have the hardware to test this?

If you want to benchmark a 120B-class model before committing to a multi-GPU build, renting is cheaper than buying three cards to find out distributed inference isn’t faster. A few hours on a cloud GPU through RunPod costs less than the shipping on a used 3090 and lets you measure real numbers on your actual model and quant. We covered the rent-vs-buy break-even in detail elsewhere on the site — for a one-off “does this even work” test, the cloud wins.

FAQ

Does EXO support NVIDIA GPUs in 2026? Not in the official builds — on Linux, mainline EXO runs on CPU, with GPU support under development. NVIDIA CUDA acceleration is available through the community exo-cuda fork (tinygrad backend), confirmed on cards like the Tesla V100. For native CUDA multi-GPU, use llama.cpp or vLLM instead.

Will three RTX 3090s beat a DGX Spark with EXO? No. The cost math no longer favors it (three used 3090s run $3,200+ in June 2026, around or above a DGX Spark), and distributed inference doesn’t multiply single-request speed. Expect throughput near a single 3090, with the benefit being 72GB of pooled VRAM rather than 24GB.

Does pooling GPUs make inference faster? Generally no, for a single request. Pipeline parallelism gives you more memory at roughly one card’s speed. Tensor parallelism (vLLM, or EXO at best case) can speed things up — up to ~3.2× on four devices — but only with a fast interconnect, and mostly for batched serving rather than a single chat.

What’s the cheapest way to run a 70B model locally? Two used 24GB cards (or one 24GB card at Q2/Q3 if you tolerate ~10 tok/s). Pooling with llama.cpp’s --tensor-split is the simplest path. See our used RTX 3090 value analysis.

Is EXO worth it for an all-Mac setup? Yes, if you already own multiple Apple Silicon Macs and need to run models larger than any single one can hold. That’s the scenario EXO is built and benchmarked for.

Sources

Last updated June 11, 2026. Prices and specs change; verify current rates before purchasing.

Recommended Gear

RTX 3090 (used, 24GB) — still the value pick for pooling VRAM on NVIDIA via llama.cpp or vLLM.

Was this article helpful?