May 19, 2026

AMD ROCm 7.2 on Windows in 2026: Tested on RDNA 3 & 4 (Real Results)

By RunAIHome Team · 14 min read

amdrocmrdna4rdna3gpulocal-aiollamacomparisonlinuxwindows

The “AMD doesn’t work for AI” reputation has been baked into home lab advice since ROCm 5. Builders pass on Radeon cards at every price point, even when the specs look compelling, because they’ve been burned before — missing drivers, environment variable hacks, tools that refuse to start on an AMD card even when the documentation says they should.

ROCm 7.2, released January 21, 2026, changed the story enough to revisit it. The question isn’t whether AMD works at all anymore. The question is: works for whom, on which tools, and compared to what?

The short answer: AMD ROCm is genuinely usable on Linux in 2026, meaningfully limited on Windows for anything older than RDNA 4, and still roughly 1.5× behind NVIDIA on raw inference speed per dollar. Whether that trade-off is acceptable depends on what you’re building and where you run it.

What ROCm 7.2 actually changed

ROCm 7 was AMD’s attempt to stop being a second-class citizen in local AI. Version 7.2, released January 21, 2026, brought three changes that matter for home lab users:

Unified Windows and Linux release. Starting with ROCm 7.2.2 — highlighted at CES 2026 — AMD ships one release package for both platforms. Previously, the Windows and Linux SDK builds diverged in feature coverage and release timing. Unifying them signals that AMD is treating Windows as a supported platform, not an afterthought.

RDNA 4 added to the official consumer GPU list. ROCm 7.2 officially supports the Radeon RX 9070, RX 9070 XT, RX 9060 XT LP, and Radeon AI PRO R9600D. “Official” means AMD tests these in CI, they get day-0 support for new ROCm releases, and you install ROCm without environment variable overrides to get them working.

Native vLLM and llama.cpp integration. Pre-built vLLM wheels for ROCm 7.2.1 are available and maintained. AMD’s ROCm documentation now includes official llama.cpp installation instructions. These aren’t community workarounds — AMD engineers contribute directly to both projects.

What ROCm 7.2 did not do: it did not bring the full ROCm stack to RDNA 3 on Windows. The RX 7000 series (gfx1100, gfx1101, gfx1102) remains Linux-only for ROCm. On Windows, the ROCm stack officially supports only gfx1200 and gfx1201 — the RDNA 4 chips in the RX 9000 series.

GPU support in 2026: who’s in, who’s out

AMD’s ROCm compatibility matrix divides consumer cards into two distinct situations:

GPU	Architecture	Linux ROCm	Windows ROCm	Notes
RX 9070 XT / RX 9070	RDNA 4 (gfx1201/gfx1200)	Official	Official	Best ROCm support of any consumer card
RX 7900 XTX / 7900 XT / 7900 GRE	RDNA 3 (gfx1100)	Official	Not officially supported	Linux stable; shares gfx target with PRO W7900
RX 7800 XT / 7700 XT	RDNA 3 (gfx1102/gfx1101)	Supported	Not supported	Added in ROCm 7.2 for Linux
RX 7600	RDNA 3 (gfx1102)	Supported	Not supported	8 GB VRAM limits usefulness for most models
RX 6000 series (RDNA 2)	gfx1030	Community only	No	Not on AMD’s official ROCm list
iGPUs (780M, 890M, Strix Halo)	RDNA 3/4	Partial	Partial	HSA_OVERRIDE_GFX_VERSION still sometimes required

The practical split is: RDNA 4 on Linux or Windows, or RDNA 3 on Linux only. If you’re on Windows and want a real ROCm stack, the RX 9070 or 9070 XT is the only consumer card currently on AMD’s supported list.

The RX 9070 XT launched at $599 MSRP. As of May 2026, retail prices sit between $629 and $669 depending on board partner and retailer — ASUS quietly raised prices 17.5% in April, though most other retailers haven’t followed. The RX 9070 non-XT starts around $549–$579. At that range, the 9070 XT competes directly with the RTX 4070 ($549) and undercuts the RTX 5070 ($549–$599) for 16 GB of VRAM vs the 4070’s 12 GB.

The spec that matters most: bandwidth

LLM inference is memory-bandwidth bound. When your GPU generates a token, it’s reading model weights from VRAM, not executing complex math. Tokens per second scale with how fast you can move weights through memory.

Here’s where the main AMD consumer cards land:

GPU	VRAM	Memory Bandwidth	TDP
RX 9070 XT	16 GB GDDR6	640 GB/s	220 W
RX 9070	16 GB GDDR6	576 GB/s	190 W
RX 7900 XTX	24 GB GDDR6	960 GB/s	355 W
RX 7900 XT	20 GB GDDR6	800 GB/s	300 W
RTX 4090 (NVIDIA)	24 GB GDDR6X	1,008 GB/s	450 W
RTX 4070 Super (NVIDIA)	12 GB GDDR6X	504 GB/s	220 W

The RX 9070 XT’s 640 GB/s is a meaningful step up from the RTX 4070 Super’s 504 GB/s — though the 4070 Super runs hotter quantized-matrix-multiply kernels through NVIDIA’s dedicated Tensor Cores in ways raw bandwidth doesn’t capture. The RX 7900 XTX at 960 GB/s is close to the RTX 4090’s 1,008 GB/s, but again the Tensor Core advantage means NVIDIA translates that bandwidth more efficiently.

Tool compatibility: the real picture

Before committing to AMD, map your tools against what actually works:

Tool	RDNA 4 (Linux)	RDNA 4 (Windows)	RDNA 3 (Linux)	RDNA 3 (Windows)
Ollama	Full GPU accel	Experimental Vulkan (OLLAMA_VULKAN=1)	Full GPU accel	Experimental Vulkan only
llama.cpp	ROCm official	Vulkan backend (works; sometimes faster)	ROCm official	Vulkan backend only
LM Studio	ROCm (v0.3.19+)	Vulkan / OpenCL	ROCm (Linux)	Vulkan / OpenCL
ComfyUI Desktop	ROCm v0.7.0+	Official (v0.7.0+, Jan 2026)	ROCm	No official ROCm
vLLM	ROCm wheels	Docker/Linux containers only	ROCm stable	No
PyTorch	Stable	ROCm 7.2 partial	Stable	Partial
Open WebUI	Full (via Ollama)	Full (via Ollama)	Full (via Ollama)	Full (via Ollama)

For Windows users: The only AMD card with meaningful coverage is RDNA 4. Even then, Ollama on Windows still falls back to experimental Vulkan (OLLAMA_VULKAN=1), LM Studio uses Vulkan or OpenCL, and vLLM requires Docker with Linux containers. ComfyUI Desktop added official ROCm Windows support in January 2026 (v0.7.0) — that’s real progress. But CUDA’s Windows coverage depth is not yet matched.

For Linux users: RDNA 3 and RDNA 4 both work well. The old HSA_OVERRIDE_GFX_VERSION environment variable hack — once required for RDNA 3 discrete GPUs — is no longer needed for officially supported cards under ROCm 7.x. You install ROCm, install Ollama, run your model. The setup gap that plagued AMD two years ago is largely closed on Linux.

Performance benchmarks: what you actually get

Here’s where the ROCm story gets complicated, because there are really two sub-questions: how does AMD hardware compare to NVIDIA, and which backend (ROCm or Vulkan) should you use on AMD?

RDNA 4 vs NVIDIA:

LocalScore benchmarks put the RX 9070 XT at roughly 90 tok/s on Llama 3.1 8B Q4_K_M — competitive with the RTX 4070 class, well below the RTX 4090’s 135–142 tok/s on the same model. At the 14B tier, the RX 9070 XT delivers approximately 45 tok/s vs the RTX 4090’s 90–104 tok/s.

RDNA 3 vs NVIDIA:

The RX 7900 XTX sits in an interesting position. Its 960 GB/s bandwidth is close to the RTX 4090’s 1,008 GB/s, but NVIDIA’s Tensor Cores for INT4 and FP8 quantized operations give the 4090 a practical advantage. Community llama.cpp ROCm benchmarks place the RX 7900 XTX at 75–98 tok/s on 7B–8B Q4 models on Linux with ROCm, against the RTX 4090’s 135–142 tok/s — roughly a 1.5× gap in NVIDIA’s favor.

Model tier	RX 9070 XT (ROCm/Vulkan)	RX 7900 XTX (ROCm, Linux)	RTX 4090 (CUDA)
8B Q4_K_M	~90 tok/s	~85–98 tok/s	~135–142 tok/s
14B Q4_K_M	~45 tok/s	~55–65 tok/s	~90–104 tok/s
32B Q4_K_M	~20 tok/s	~28–33 tok/s	~45–55 tok/s
70B Q4 (partial offload)	~6–9 tok/s	~12–18 tok/s	~20–26 tok/s

Sources: LocalScore (RX 9070 XT, RTX 4090), community llama.cpp ROCm benchmarks from 1337hero/rx7900xtx-llama-bench-rocm and llama.cpp ROCm discussion threads on GitHub (RX 7900 XTX). Actual performance varies by system configuration and ROCm version.

The RDNA 4 Vulkan twist

Here’s the nuance that most “ROCm is good now” coverage misses: on RDNA 4, the Vulkan backend in llama.cpp can actually outperform ROCm HIP by 14–30% for generation throughput.

The reason is a Wave32 vs Wave64 mismatch. RDNA 4 consumer GPUs execute in Wave32 (32 threads per wavefront). ROCm’s HIP backend was optimized for Wave64 execution on RDNA 3 and enterprise cards; the Wave32 implementation on RDNA 4 has known performance gaps. The Vulkan backend, which targets Wave32 directly, sidesteps this entirely.

There’s also an idle power bug in llama.cpp’s HIP backend on RDNA 4 that locks the GPU at elevated clock speeds until the process is killed. The Vulkan backend doesn’t have this issue.

Practical takeaway for RX 9070 XT owners: for llama.cpp inference on Linux or Windows, test the Vulkan backend first (-ngl 99 -mg 0 with llama-server). It may perform better than ROCm HIP on your specific model and quantization, and it runs cleaner. ROCm HIP remains the right call for vLLM and PyTorch-based workflows where Vulkan support doesn’t exist.

What still doesn’t work cleanly

The remaining friction points in May 2026:

RDNA 3 on Windows is effectively unsupported for the ROCm stack. You can use the experimental Vulkan path in Ollama or run llama.cpp with the Vulkan backend, but you’re outside AMD’s official support scope. If you buy an RX 7900 XTX for a Windows machine expecting ROCm to work like CUDA does, you’ll be disappointed.

Custom ComfyUI nodes on AMD still lag CUDA. ComfyUI Desktop added official ROCm support for Windows in January 2026, which is a genuine milestone. But custom nodes — ControlNet, IP-Adapter, some advanced samplers — often implement CUDA-specific paths first. AMD compatibility follows months later, if at all. If you’re deep in custom workflows, CUDA is safer.

vLLM in production on AMD requires Docker on Linux. Pre-built ROCm 7.2.1 wheels are available, but multi-GPU setups and production deployments lean on Docker with AMD’s ROCm nightly images. On Windows, vLLM isn’t a viable option without WSL2.

Flash attention 2 with paged attention — used by vLLM for efficient KV cache management — was added to ROCm later than the CUDA path and can require specific PyTorch + ROCm version matching. It works, but it’s not the no-friction experience CUDA users have.

If you’re evaluating AMD for local AI but don’t have hardware yet, RunPod lets you rent RTX 4090 or AMD MI300X instances while you sort out the hardware decision — useful for benchmarking your specific workload before buying.

Who should actually consider AMD in 2026

Buy RDNA 4 (RX 9070 XT) if:

You’re on Linux as your primary OS, or on Windows and specifically want AMD
Your budget is $600–$700 and you want 16 GB of VRAM (the RTX 4070 Super at $599 only gives you 12 GB)
Your workload is 8B–14B models and 90 tok/s feels fast enough
You’re OK with Vulkan as a fallback on Windows while ROCm matures

Buy RDNA 3 (RX 7900 XTX, used) if:

You’re on Linux
You want 24 GB of VRAM for 30B+ models without paying RTX 4090 prices
You can find used 7900 XTX cards at competitive prices — completed eBay listings fluctuate; verify before buying
You’re comfortable with Linux GPU driver setup and won’t need Windows ROCm support

Stick with NVIDIA if:

You’re primarily on Windows
ComfyUI custom nodes are a significant part of your workflow
You want vLLM in production beyond single-GPU setups
You don’t want to think about whether your toolchain supports your GPU backend

Consider NVIDIA over AMD on a sub-$500 budget: the RTX 4060 Ti 16GB at $449 has better Windows compatibility than any AMD card at that price, despite losing on raw bandwidth to the used RX 7900 XT.

Honest take

AMD ROCm in 2026 is a workable choice — not a compromise you make by accident, but a deliberate decision with a real profile. The person for whom it makes sense runs Linux, cares about VRAM capacity per dollar, and isn’t deep in custom ComfyUI workflows.

For that person, the setup friction that defined AMD AI a couple of years ago is genuinely gone on Linux. You install ROCm 7.2, install Ollama, run your model. RDNA 3 and RDNA 4 both work without the environment variable gymnastics that used to be standard. That’s real progress.

The performance trade-off is also real. At 90 tok/s on an 8B model, the RX 9070 XT is plenty fast for interactive chat and coding assistance. It’s not the 135 tok/s you’d get from an RTX 4090 — but the RTX 4090 costs $1,600–$2,000 used. The RX 9070 XT costs $629. They’re not the same comparison.

Where AMD loses consistently: Windows, heavy image generation, and any workload where CUDA-specific library optimizations (flash attention, INT4 Tensor Core matmuls) move the needle. On Windows specifically, the gap between what AMD promises and what actually works in your toolchain is still frustrating enough to cost you real time.

The “finally usable” verdict is conditional. On Linux with RDNA 3 or RDNA 4: yes, finally. On Windows with RDNA 4: mostly, for a narrowing set of tools. On Windows with RDNA 3: not yet, unless you’re comfortable with Vulkan workarounds as the primary path.

For context on full GPU selection including NVIDIA alternatives at every budget, see the GPU buying guide. For the head-to-head between the 16GB RDNA 3 and 16GB CUDA options at similar price points, the RTX 4060 Ti 16GB vs RX 7900 XT comparison has the decision matrix. If you’re evaluating vLLM specifically — which behaves differently on AMD vs NVIDIA at scale — the vLLM vs Ollama concurrency breakdown covers how the multi-user picture changes things.

1V1 PLAYBOOK · LOCAL LLM

Cut your local AI bill from $400/month cloud GPU to $47/month at home.

4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.

Get it for $19 (early bird) →

Sources

Last updated May 19, 2026. Prices and specs change; verify current rates before purchasing.

Recommended Gear

The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):

Was this article helpful?