Multi-GPU for Local AI in 2026: NVLink vs PCIe and When a Second Card Actually Helps

multi-gpunvlinkpcielocal-aillmhardwareinferencertx-3090rtx-4090

If you are researching multi-GPU setups for local AI and NVLink keeps coming up, here is the short version first: NVLink is only available on the RTX 3090 among consumer GPUs. The RTX 4090, RTX 5090, and every other Ada or Blackwell GeForce card does not support it. If you are running any of those, you are doing multi-GPU over PCIe whether you know it or not.

That is not necessarily a problem. But it does change what you should expect, how you should configure your software, and whether adding a second card is worth it at all. This guide covers all three questions with verified numbers.

NVLink is NVIDIA’s proprietary high-bandwidth GPU-to-GPU interconnect. On data center hardware it provides extraordinary bandwidth — 600 GB/s on A100s, 900 GB/s on H100s. On consumer hardware the story is much simpler: NVIDIA supported NVLink on exactly one consumer GPU generation (Ampere, 2020–2021), then removed it entirely.

Here is the full consumer NVLink support table:

GPUArchitectureNVLink supportBandwidth
RTX 2080 TiTuringYes (NVLink 2.0)100 GB/s
RTX 3090AmpereYes (NVLink 3.0)112.5 GB/s
RTX 3090 TiAmpereNo
RTX 4070 Ti / 4080 / 4090Ada LovelaceNo
RTX 5060 Ti / 5070 / 5080 / 5090BlackwellNo
RTX PRO 6000 BlackwellBlackwell (workstation)Yes (NVLink 5.0)1,800 GB/s

The RTX 3090 Ti, announced the same generation, did not include the NVLink connector — making the base RTX 3090 the last consumer card with it. The RTX 4090 dropped NVLink entirely; NVIDIA stated it used the freed space for additional AI processing circuitry. The RTX 5090 and the rest of the 50-series continue that pattern.

What this means practically: if you want NVLink in a home lab, your only realistic option is a pair of used RTX 3090s with an NVLink bridge. Everything else is PCIe.

The bandwidth reality

To understand what this costs in performance, the numbers:

InterconnectBandwidth (bidirectional)Typical home-lab hardware
PCIe 4.0 x1664 GB/sMost AMD and Intel desktop platforms
PCIe 5.0 x16128 GB/sZ790, X670E, AM5 with Gen 5 slot
NVLink 3.0 (RTX 3090 pair)112.5 GB/sRTX 3090 + NVLink bridge
NVLink 3.0 (A100 pair)600 GB/sData center, out of home-lab budget
NVLink 4.0 (H100 pair)900 GB/sData center

One important detail for dual-GPU desktop builds: when you install two cards in a typical consumer motherboard, each card gets x8 PCIe lanes rather than x16, because the CPU’s PCIe lanes are split between slots. On PCIe 4.0, x8 = 32 GB/s bidirectional. On PCIe 5.0, x8 = 64 GB/s bidirectional.

GPU-to-GPU communication over PCIe also routes through the CPU memory controller — data moves from GPU 0 → CPU → GPU 1 — which adds latency that direct NVLink connections avoid entirely. The RTX 3090’s NVLink bridge is a direct GPU-to-GPU connection at 112.5 GB/s with no CPU hop.

For tensor-parallel inference, where each token processed requires all-reduce operations between GPUs, that bandwidth gap translates directly into throughput. Benchmarks from a 4x RTX 3090 cluster found NVLink improves inference throughput by approximately 50% for 2-GPU tensor-parallel pairs, and around 10% for 4-GPU setups where only half of GPU pairs are bridged and the rest communicate over PCIe.

When a second GPU actually helps — and when it makes things worse

Adding a second GPU is not always an upgrade. The outcome depends entirely on the relationship between model size and your GPU’s VRAM.

Scenario 1: Model doesn’t fit on one card. If you are trying to run Llama 3.3 70B Q4 (requires ~42 GB) on a single RTX 4090 (24 GB), the model simply cannot load. A second 4090 brings you to 48 GB total and the model runs. In this case, the second card is not optional — it is a requirement.

Scenario 2: Model fits on one card, you add a second anyway. This is where people get surprised. If you are running Ollama with a 14B model that fits comfortably in 24 GB of VRAM, Ollama will automatically detect your second GPU and split layers across both cards. The result, counterintuitively, is slower inference — because every token now requires PCIe data transfers between cards that were not necessary when the model lived on one GPU. Ollama’s official documentation confirms this behavior: second GPU accelerates large models that require VRAM pooling; it hurts small models that would otherwise run fully on one card.

Scenario 3: High-concurrency serving. If you are running vLLM and serving 10+ simultaneous users, tensor parallelism across two GPUs can roughly double throughput compared to a single-GPU setup, because both GPUs work on each request in parallel. The PCIe overhead is amortized across many concurrent requests. This is the use case where PCIe multi-GPU genuinely earns its keep even without NVLink.

The decision matrix:

SituationAdd second GPU?Reasoning
70B+ model, single GPU too smallYes, requiredVRAM pooling is the only path
Personal use, <14B modelsNo — makes it slowerPCIe overhead > compute gain
vLLM serving, 10+ concurrent usersYesThroughput scales well
Fine-tuning / QLoRACloud insteadSee cloud GPU math
Ollama, model fits on one cardNoOllama adds overhead, not speed

For home-lab users who specifically want NVLink, this is the only practical path. Two used RTX 3090s connected with an NVLink bridge give you:

  • 48 GB combined VRAM — enough for Llama 3.3 70B at Q4_K_M with context headroom
  • 112.5 GB/s GPU-to-GPU bandwidth — ~3.5× the throughput of PCIe 4.0 x8
  • 50% throughput improvement over running the same two 3090s without NVLink in tensor-parallel configurations

Hardware required:

  • Two RTX 3090 cards (NOT 3090 Ti — that card has no NVLink connector)
  • One NVIDIA NVLink Bridge 4-slot (ASIN B08S1RYPP6 on Amazon, also available from Newegg). Originally $79 MSRP; as of May 2026, available on Amazon and eBay in the $50–80 range
  • A motherboard with two PCIe x16/x8 slots with sufficient slot spacing for the 4-slot bridge

The thermal reality: Two RTX 3090s at full inference load draw approximately 350W each, putting the combined GPU power draw at ~700W. The NVLink bridge sits between the cards, blocking airflow between them. A dual-3090 NVLink rig almost always requires aftermarket solutions — open-air cases, additional case fans directly above the GPU stack, or liquid cooling. The dual RTX 3090 cooling problem is well-documented and not optional to address. Plan power supply accordingly: a 1200W+ PSU is prudent.

For more context on the RTX 3090’s value proposition individually, see Used RTX 3090 in 2026: Still the AI Value King?

Multi-GPU over PCIe: dual RTX 4090 and beyond

For the majority of multi-GPU home-lab builds in 2026 — dual RTX 4090, dual RTX 5090, any combination without NVLink — PCIe is the interconnect. Here is what to expect.

Dual RTX 4090 running Llama 3.3 70B Q4: approximately 25–30 tokens/sec generation speed with vLLM tensor parallelism. A single RTX 4090 cannot run this model at all (insufficient VRAM), so the comparison is not “25 tok/s vs 50 tok/s” but rather “25 tok/s vs not running.” That framing matters for the buying decision.

Scaling efficiency: PCIe tensor parallelism for a 70B model runs at roughly 0.70× scaling efficiency — meaning two cards produce about 70% of what ideal 2× linear scaling would predict. The remaining 30% is inter-GPU communication overhead over PCIe. This is the realistic expectation, not a problem to fix.

PCIe lane topology matters: published benchmarks show less than 2% performance difference between x16 and x8 slots for pure inference workloads — once a model is loaded, all computation happens in VRAM and PCIe traffic is minimal. However, for tensor-parallel configurations where every token triggers all-reduce operations across GPUs, x8 vs x16 shows 20–40% throughput difference in intensive model-parallel workloads. The guidance: for dual-GPU inference, x8/x8 is fine; avoid x4, which creates a real bottleneck during inter-GPU all-reduces.

For electricity and PSU planning with high-TDP dual-GPU setups, see Power Bill Math: True Cost of Running a 24/7 AI Server at Home and PSU Sizing for AI Workstations 2026.

Software setup

llama.cpp

llama.cpp supports multi-GPU via the --tensor-split flag. The flag takes a comma-separated list of proportions matching each GPU’s share of the model:

# Equal split across two GPUs
./llama-cli -m ./model.gguf \
  --tensor-split 1,1 \
  -n 512 --prompt "Your prompt here"

# Weighted split for unequal VRAM (e.g., 24GB + 16GB)
./llama-cli -m ./model.gguf \
  --tensor-split 3,2 \
  -n 512 --prompt "Your prompt here"

For inter-GPU communication overhead reduction, --split-mode layer distributes whole layers to individual GPUs (pipeline parallelism) rather than splitting each layer across GPUs (tensor parallelism). Pipeline mode requires less bandwidth but has higher latency per token.

vLLM

vLLM’s tensor parallelism is a single flag:

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 8192

vLLM requires matched GPUs (same model, same VRAM) for tensor parallelism. It uses NCCL under the hood, which means it utilizes full NVLink bandwidth if available, or falls back to PCIe. For non-NVLink multi-GPU consumer setups, vLLM’s own documentation recommends pipeline parallelism (--pipeline-parallel-size 2) instead of tensor parallelism for lower communication overhead when cards lack a high-bandwidth interconnect.

Ollama

Since Ollama v0.4, it automatically detects all NVIDIA and AMD GPUs and distributes model layers across them, weighted by VRAM capacity. There are no flags to set. The important caveat: Ollama only does pipeline-style layer splitting, not tensor parallelism. And as noted above, if a model fits on a single GPU, forcing it onto two cards via Ollama degrades performance — the layer-transfer overhead over PCIe costs more than the parallelism gains.

To force Ollama to use only one GPU when you have two installed, set CUDA_VISIBLE_DEVICES=0 in the environment before starting the Ollama service.

Decision guide: who actually benefits from multi-GPU

Before spending $700–$2,400 on a second card, run through this:

Your situationRecommendation
Running 70B models, single GPU has insufficient VRAMTwo matched cards required; vLLM tensor parallel
Running 70B models, want NVLinkDual used RTX 3090 + NVLink bridge (see RTX 3090 used market guide)
Running 70B models, have two RTX 4090s or 5090sPCIe is fine; 25–30 tok/s is realistic
Personal use, <30B modelsSingle high-VRAM card better value; don’t split across GPUs
Multi-user production serving (10+ concurrent)vLLM tensor parallel over PCIe is worthwhile
Fine-tuning / QLoRA runsCloud GPU wins on total cost; see cloud math

The honest take: Most home-lab users running multi-GPU are doing it because a single card does not have enough VRAM — not because they are optimizing throughput. That is a completely valid reason. But if your models fit on a single card, the second card often slows Ollama down and only helps vLLM under concurrent load. Know which scenario you are actually in before the purchase.

NVLink is not worth chasing for new builds in 2026. The RTX 3090 is a 2020 GPU; its NVLink advantage is real but paired with older architecture, higher power draw, and the additional thermal complexity of bridging two 350W cards. If your workload needs 48 GB, a dual RTX 3090 NVLink rig is still a legitimate option — particularly at used prices ranging from ~$680 to ~$1,000 per card on eBay in May 2026 (the market fluctuates — check the used RTX 3090 guide for current data). If your workload needs more than 48 GB, you are looking at cloud time on RunPod or purpose-built workstation GPUs.

For the broader GPU selection picture, the GPU buying guide for local AI covers every budget tier from $300 to $3,000+. For choosing between local hardware and cloud rental, see RunPod vs Local GPU: When to Rent and When to Buy.

1V1 PLAYBOOK · LOCAL LLM

Cut your local AI bill from $400/month cloud GPU to $47/month at home.

4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.

Get it for $19 (early bird) →

Sources

Last updated May 21, 2026. GPU prices and used-market availability fluctuate; verify current rates before purchasing.


The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):

Was this article helpful?