Multi-GPU for Local AI in 2026: NVLink vs PCIe, and When Two Cards Actually Help

multi-gpunvlinkpciellm-inferencertx-3090rtx-4090local-aihardware

The VRAM ceiling is what forces most people toward multi-GPU territory. Llama 3.3 70B at Q4_K_M quantization needs roughly 43 GB of GPU memory. No single consumer card clears that bar—the RTX 4090 maxes at 24 GB, the RTX 5090 at 32 GB, and everything in between falls short. Two cards change the math.

But “two cards” means very different things depending on how they communicate. NVLink gives GPUs a direct high-bandwidth wire between them. PCIe routes the same traffic through your CPU’s memory controller. In 2026, the distinction matters more than it used to—because NVLink has quietly disappeared from almost every consumer GPU on the market.

Here’s what the bandwidth numbers actually mean for inference, what modern frameworks do with multiple GPUs, and when the extra complexity is worth it.

NVLink is NVIDIA’s peer-to-peer GPU interconnect. Instead of routing inter-GPU data through the CPU memory bus—as PCIe does—NVLink provides a direct path with its own dedicated bandwidth pool.

On the RTX 3090, NVLink 3.0 delivers 112.5 GB/s of aggregate bidirectional bandwidth between two cards. You need a physical NVLink bridge—a short PCB bar that clips across both GPUs. These sell for $40–$80 used on eBay.

The problem: NVLink was removed starting with the Ada Lovelace generation (RTX 40-series) and doesn’t return on Blackwell consumer cards.

GPUNVLinkNotes
RTX 3090Yes — NVLink 3.0, 112.5 GB/sOnly the base 3090, not 3090 Ti
RTX 3090 TiNoConnector physically removed
RTX 4070 / 4080 / 4090NoEntire Ada consumer lineup
RTX 5080 / 5090NoConsumer Blackwell lineup
RTX PRO 6000 BlackwellYes — NVLink 5, 1,800 GB/sWorkstation card, ~$6,000+

NVIDIA CEO Jensen Huang confirmed the RTX 4090 removal was intentional—freed die area went to Ada’s DLSS 3 hardware and transformer engine. The RTX PRO 6000 Blackwell does have NVLink 5 at a staggering 1,800 GB/s bandwidth, but at $6,000+ it’s outside the home lab conversation.

For anyone building a multi-GPU local AI setup in 2026, the RTX 3090 (specifically the non-Ti variant) is the only consumer card where NVLink is an option.

The RTX 3090 Ti is the most common trap here: it looks like the obvious upgrade, but NVIDIA removed the NVLink connector. Only the base RTX 3090 supports it.

PCIe inter-GPU bandwidth: what you actually get

Without NVLink, two GPUs communicate over PCIe—which means routing through the CPU’s memory controller. The bandwidth depends on the PCIe generation and slot width:

InterconnectBandwidth (per direction)Common hardware
PCIe 3.0 x1616 GB/sPre-2021 boards
PCIe 4.0 x1632 GB/sMainstream Z490/B550 and newer
PCIe 5.0 x1664 GB/sZ890/X870 (2024+ platforms)
NVLink 3.0 (RTX 3090)56.25 GB/s per directionBridged 3090 pair

PCIe 4.0—the most common current standard—provides 32 GB/s per direction versus NVLink 3.0’s 56.25 GB/s. That’s roughly 1.75× slower. On a PCIe 5.0 platform (Z890 or X870), the gap essentially disappears.

Whether that bandwidth difference matters for inference depends entirely on which parallelism strategy the framework uses.

Two strategies, two bandwidth profiles

There are two fundamentally different ways to distribute a model across GPUs, and they have very different interconnect demands.

Pipeline parallelism (layer split)

GPU 0 handles the first half of the model’s transformer layers; GPU 1 handles the second half. At the layer boundary, the activation tensor—a few megabytes for a typical 70B model—transfers from one GPU to the other.

This transfer happens once per token, once per layer boundary. The bandwidth demand is low: even PCIe 3.0 handles it without becoming a bottleneck. llama.cpp’s default --tensor-split mode uses this approach.

The cost is efficiency: GPU 0 sits idle while GPU 1 processes its half, and vice versa. You get the combined VRAM of both cards, but autoregressive token generation is essentially sequential between the two GPUs.

Tensor parallelism (every-layer split)

Both GPUs process every transformer layer simultaneously. Each holds half the weight matrices, computes in parallel, then synchronizes partial results via an all-reduce after each layer.

Llama 3.3 70B has 80 transformer layers. Every token generation involves 80 all-reduce round trips, each carrying 8–32 MB of activation data at float16. This is where the interconnect bandwidth matters.

At NVLink 3.0 speeds (56.25 GB/s per direction), those synchronizations clear quickly. Over PCIe 4.0 (32 GB/s per direction), the link starts saturating at longer context lengths. At 4k context—typical for most interactive use—dual RTX 4090s over PCIe 4.0 with tensor parallelism achieve roughly 85–90% of equivalent NVLink-connected throughput. At 32k+ context, the penalty grows.

llama.cpp multi-GPU setup in 2026

llama.cpp has two distinct paths for multi-GPU.

Layer split (default, widely supported):

llama-server \
  --model Meta-Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  --tensor-split 0.5,0.5 \
  --n-gpu-layers 999 \
  --ctx-size 4096 \
  --port 8080

The --tensor-split 0.5,0.5 divides layers evenly across both GPUs. Adjust the ratio if cards have different VRAM—e.g., 0.6,0.4 for a 24GB + 16GB pair. Ollama uses this same mechanism automatically when multiple CUDA GPUs are detected.

True tensor parallelism (merged April 2026, build b8738+):

Mainline llama.cpp gained real tensor parallelism in April 2026 via build b8738, using NCCL (NVIDIA) or RCCL (AMD) for topology-aware communication. It auto-detects NVLink vs PCIe and adjusts synchronization accordingly.

# Build with NCCL tensor parallel support
cmake .. \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_FORCE_DMMV=OFF \
  -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
make -j$(nproc)

# Launch with tensor parallelism across 2 GPUs
./llama-server --model model.gguf --tp 2 -ngl 999 --ctx-size 4096

Benchmarks from the PR show 3–4× gains over layer split—but that headline applies primarily to MoE architectures (Qwen3 MoE, Llama 4 MoE), where expert routing creates uneven layer utilization that hurts pipeline parallelism. For dense models like Llama 3.3 70B, the improvement over layer split is more modest: 1.3–1.6× in typical benchmarks.

vLLM: the high-concurrency path

vLLM handles multi-GPU via tensor parallelism as the default:

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 2 \
  --quantization awq \
  --dtype float16

vLLM probes the NCCL topology on startup and automatically uses NVLink-optimized all-reduce on bridged cards, falling back to PCIe-routed NCCL otherwise. For single-user local inference, both paths produce comparable results. For multi-user serving (8+ concurrent requests), pipeline parallelism via --pipeline-parallel-size 2 often outperforms tensor parallel on PCIe—it avoids the all-reduce overhead entirely, with each GPU acting as an independent pipeline stage.

The full concurrency breakdown is in vLLM vs Ollama in 2026: When Each One Wins.

Real performance: what to expect

Llama 3.3 70B Q4_K_M needs ~43 GB of VRAM—here’s what the main consumer configurations look like:

ConfigurationTotal VRAMLlama 3.3 70B Q4 tok/sPower draw
Single RTX 409024 GB~8–10 tok/s (Q2 only, quality loss)450W
Single RTX 509032 GB~15–18 tok/s (Q3 max)575W
Dual RTX 3090 NVLink48 GB15–20 tok/s~700W combined
Dual RTX 3090 PCIe 4.048 GB~10–14 tok/s (est.)~700W combined
Dual RTX 4090 PCIe 4.048 GB~28–40 tok/s (est.)~900W combined

The dual RTX 3090 NVLink numbers are the best-benchmarked: 15–20 tok/s for Llama 3.3 70B Q4_K_M in llama.cpp layer-split mode, confirmed across multiple community setups. NVLink makes a meaningful difference here over PCIe—the unified memory pool eliminates software-level layer assignment overhead, and the higher inter-GPU bandwidth helps at longer contexts.

The dual RTX 4090 PCIe estimate is derived from the per-card memory bandwidth advantage (1,008 GB/s on 4090 vs 936 GB/s on 3090), but specific published benchmarks for this exact configuration vary by software stack. If you’re evaluating this setup, budget for the lower end of that range with PCIe 4.0.

For 30B-range models, multi-GPU provides diminishing returns:

ConfigQwen3 32B Q4 tok/sNotes
Single RTX 409035–40 tok/sFits in 24 GB, no inter-GPU overhead
Single RTX 509050–60 tok/s32 GB gives substantial headroom
Dual RTX 3090 NVLink~38–46 tok/sOverkill on VRAM, minor throughput gain

A single RTX 4090 handles Qwen3 32B Q4 cleanly within its 24 GB. Adding a second GPU for a 30B model increases power consumption by ~350W for a marginal speed bump. The full VRAM picture across quantization levels is in How Much VRAM Do You Need for Llama Models.

What multi-GPU actually costs you

The benchmark numbers are the easy part.

Thermal density. Two RTX 3090s each run at 350W TDP. In a standard ATX mid-tower, even well-ventilated cards end up sharing hot exhaust air—particularly when open-air designs stack their cooling zones. Most people with 24/7 dual-GPU inference setups end up either undervolting to reduce heat (which drops power draw to ~280W per card with minimal performance loss) or moving to a server chassis with proper linear airflow.

PCIe slot requirements. Two full-size cards in adjacent slots leave one slot of clearance between them. First-card exhaust blows directly into second-card intake. Slots with two-slot physical separation between cards fix this—check the spacing before buying a board for a dual-GPU build.

Power supply. Two RTX 3090s or 4090s need a 1,000W+ PSU with the right connector count. The RTX 4090 uses a 16-pin 600W connector; two of them plus CPU and storage easily pushes 1,000–1,200W system draw. Our PSU sizing guide for AI workstations has the exact calculation.

Motherboard compatibility for NVLink. The NVLink bridge requires two full-size PCIe slots at exactly two-slot or three-slot spacing. Not all ATX boards accommodate this. X570 and B550 boards vary—verify the physical slot layout against the bridge dimensions before purchasing. The bridge itself comes in 2-slot and 3-slot variants.

Software setup time. Layer split in Ollama or llama.cpp is plug-and-play—Ollama detects both GPUs and distributes automatically. Tensor parallelism in the April 2026 llama.cpp build still has known issues with some ROCm combinations and certain GGUF quantization formats. Expect to spend an afternoon debugging if you deviate from the standard CUDA + GGUF path.

The single-device alternative worth considering

For 70B inference specifically, a Mac Studio M3 Ultra with 96 GB unified memory is worth putting in the comparison. It runs Llama 3.3 70B Q4_K_M at roughly 40–50 tok/s via MLX—comparable to or faster than dual RTX 3090 NVLink—with one box, one power cable, and zero inter-GPU configuration.

The trade-off is real: no CUDA ecosystem, image generation (Flux, SDXL) runs 3–5× slower than on an NVIDIA card, and you’re locked into Apple’s hardware cadence. For text-only inference workflows, it’s a legitimate single-device competitor to dual-GPU. We covered the full hardware comparison in Mac Studio M3 Ultra vs Dual RTX 4090.

For occasional 70B jobs without the permanent hardware commitment, RunPod community pods offer dual RTX 4090 instances at ~$0.54/hr. That’s useful for testing a 70B model’s behavior before deciding whether the build is worth it.

Honest take: who should actually go multi-GPU

Build a dual-GPU system if:

  • Your primary model is 70B or larger at Q4 quality (43+ GB VRAM required)
  • You’re running multi-user inference—the extra VRAM dramatically expands the KV cache you can maintain per user, which matters at 8+ concurrent sessions
  • You already own one RTX 3090 or 4090 and can add a second one for under $700 (used 3090) or under $2,400 (used 4090)

Stick with single GPU if:

  • Your primary model is 34B or smaller—a single RTX 4090 runs Qwen3 32B or Llama 3.1 34B cleanly with no offload
  • You’re primarily doing image generation (Flux, SDXL)—these don’t parallelize efficiently across consumer GPUs
  • You’re on an older motherboard with PCIe 3.0 slots—the bandwidth penalty on tensor parallelism is significant enough that the second card delivers less than expected

If you go dual, prefer:

  • Dual RTX 4090 (PCIe) over dual RTX 3090 (NVLink or not): higher per-card memory bandwidth wins on total throughput for inference, and the ~$200 NVLink bridge premium on 3090 doesn’t recover its cost in tok/s gains
  • That said, dual RTX 3090 NVLink at ~$1,400 total (two used 3090s + bridge) is still the cheapest path to a verified 15–20 tok/s on 70B Q4—the used RTX 3090 value case still holds if your budget is tight

NVLink as a consumer feature is essentially finished. The RTX 3090 pair is the last consumer configuration where it exists, and within three to four years those cards will age out of relevance as mainstream models push past what 48 GB can hold. Plan your multi-GPU build around PCIe parallelism—it performs well enough that the interconnect is not your bottleneck for inference workloads.

1V1 PLAYBOOK · LOCAL LLM

Cut your local AI bill from $400/month cloud GPU to $47/month at home.

4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.

Get it for $19 (early bird) →

Sources

Last updated May 21, 2026. GPU prices and availability shift weekly; verify current listings before purchasing.


The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):

Was this article helpful?