Jun 4, 2026

ROCm 7.2 on Ubuntu 24.04 for Local LLMs in 2026: Full Setup Guide for AMD GPUs

By RunAIHome Team · 13 min read

amdrocmubuntulocal-llmrdna3rdna4gpuollamatutorial

TL;DR: ROCm 7.2.3 (released May 4, 2026) is the stable Ubuntu path for AMD GPU inference — RDNA 3 setup is rock-solid in under 20 minutes, RDNA 4 works with one Docker workaround for a known gfx1201 bug. AMD delivers 85–92% of equivalent NVIDIA throughput at a lower price point on Ubuntu.

What you’ll be able to do after this guide:

Install ROCm 7.2.3 on Ubuntu 24.04 LTS and run ollama serve with full GPU acceleration in under 20 minutes
Build llama.cpp with the HIP backend for maximum throughput on RDNA 3 and RDNA 4
Identify and work around the gfx1201 rocBLASLt crash that kills model loads on the RX 9070 XT

Honest take: On Ubuntu, a used AMD Radeon RX 7900 XTX at ~$800 is the best AMD card for local LLMs in 2026 — 24 GB VRAM, ~96 tok/s on Llama 3.1 8B Q4_K_M, and ROCm 7.2.3 installs without a single environment variable hack.

AMD’s local AI story on Linux has changed substantially. A year ago you’d fight missing kernel modules and half-broken pip wheels. Today, if you’re on a supported card and Ubuntu 24.04, setup is close to the CUDA experience: download a .deb, add yourself to two groups, reboot once, and ollama pull llama3.1:8b works.

The catches are smaller than they used to be, but they exist. RDNA 4 support (RX 9000 series) is still maturing in a specific way — one rocBLASLt lookup bug can SIGKILL your model load at the 2-minute mark every single time. Knowing where the landmine is before you start saves 90 minutes of frustrating debugging.

This guide covers the native Ubuntu install path. If you need AMD on Windows, see the AMD ROCm 7.2 on Windows guide — RDNA 3 is Linux-only for ROCm and the Windows path is a different story entirely.

Which cards are actually supported on Ubuntu 24.04

ROCm 7.2.3 divides AMD’s consumer lineup into three buckets:

Fully supported on Linux:

RDNA 4: RX 9070 XT, RX 9070, RX 9060 XT LP, Radeon AI PRO R9600D (gfx1201 / gfx1200)
RDNA 3: RX 7900 XTX, RX 7900 XT, RX 7900 GRE, RX 7800 XT, RX 7700 XT (gfx1100 / gfx1101 / gfx1102)

RDNA 3 consumer cards are Linux-only for ROCm. On Windows, the ROCm stack officially supports only RDNA 4 chips — see above.

Supported via Vulkan only (no ROCm HIP):

RX 7600, RX 6000 series, anything older. These cards can run inference through llama.cpp’s Vulkan backend, but won’t get vLLM, PyTorch ROCm, or HIP acceleration.

Not supported at all:

RDNA 1 (RX 5000 series) and older. Vulkan may work, but inference speed makes these impractical for anything beyond tiny models.

VRAM is still the ceiling

For context on what each card can run:

RX 7900 XTX (24 GB): Qwen3-30B-A3B at Q4_K_M fits cleanly. Llama 3.3 70B Q4 in CPU-offload mode. Anything under 20B at Q4 is comfortable.
RX 9070 XT (16 GB): Llama 3.1 8B at full speed, Qwen3-14B at Q4, 27B MoE models technically fit but saturate the memory bus (6.3 tok/s on Qwen3.5-27B-A3B at Q4).
RX 9070 GRE (16 GB): Launched globally at $549 on June 2, 2026 — same VRAM and gfx1201 architecture as the 9070 XT, slightly less shader compute.

For a broader AMD vs NVIDIA VRAM comparison at the 16 GB tier, see AMD RX 9070 XT vs RTX 5060 Ti 16GB.

Step 1: Install ROCm 7.2.3 on Ubuntu 24.04

Start from Ubuntu 24.04.3 LTS. The amdgpu-install tool handles both the kernel driver and the ROCm userspace stack in a single package.

# Download the installer for Ubuntu 24.04 (noble)
wget https://repo.radeon.com/amdgpu-install/7.2.3/ubuntu/noble/amdgpu-install_7.2.3.70203-1_all.deb

# Install it
sudo apt install ./amdgpu-install_7.2.3.70203-1_all.deb
sudo apt update

# Install ROCm with the rocm usecase
sudo amdgpu-install --usecase=rocm --no-dkms

The --no-dkms flag skips DKMS kernel module compilation. On Ubuntu 24.04.3 with a 6.8.x kernel, RDNA 3 and RDNA 4 are already supported by the packaged kernel — invoking DKMS wastes 10 minutes and sometimes fails on systems with secure boot or custom kernels.

Step 2: Add user groups and reboot

This step trips up almost every first-time installer and the error messages when you skip it are not helpful. The ROCm compute stack requires your user to be in the render and video groups to access /dev/kfd (the GPU compute device node) without root.

sudo usermod -a -G render,video $USER

Reboot now. A newgrp session is not sufficient — the group membership must be part of your login session from the start. After reboot, verify:

groups
# Expected: ... render video ...

rocminfo | head -25

Expected output from rocminfo on an RX 9070 XT:

ROCk module is loaded
...
Agent 2
  Name:                    gfx1201
  Uuid:                    GPU-XXXXXXXXXXXXXX
  Marketing Name:          Radeon RX 9070 XT
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE

If rocminfo hangs for more than 30 seconds or shows only Agent 1 (the CPU), you have a group issue or driver conflict. Check dmesg | grep amdgpu first — firmware errors here usually mean you need a linux-firmware update.

Step 3: Verify with rocm-smi

rocm-smi

This displays real-time GPU stats including temperature, power draw, and memory usage. At idle you’ll see 0% utilization — that’s normal. Run a model in the next step and check again to confirm the GPU is actually being used.

Step 4: Install Ollama with ROCm support

Ollama ships its own bundled ROCm libraries and auto-detects AMD GPUs on Linux:

curl -fsSL https://ollama.com/install.sh | sh
systemctl start ollama
ollama pull llama3.1:8b-instruct-q4_K_M
ollama run llama3.1:8b-instruct-q4_K_M

While inference is running, open a second terminal and run watch -n 1 rocm-smi. You should see GPU memory jump to ~5.5 GB and compute utilization hit 90–95%.

If memory shows 0 MB allocated despite the model loading, Ollama may be using CPU. Run OLLAMA_DEBUG=1 ollama serve and check the startup logs — it will report which ROCm libraries it found and whether the GPU was initialized.

Step 5 (optional): Build llama.cpp with HIP

For direct llama.cpp inference — more control over layer offloading and context window than Ollama provides — the HIP backend delivers the best AMD throughput:

sudo apt install cmake git build-essential

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS=gfx1201 \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

Replace gfx1201 with your card’s architecture:

Card	Target arch
RX 9070 XT / 9070 / 9060 XT	`gfx1201`
RX 7900 XTX / 7900 XT / 7900 GRE	`gfx1100`
RX 7800 XT	`gfx1101`
RX 7700 XT	`gfx1101`

Run the server:

./build/bin/llama-server \
  -m /path/to/model.gguf \
  --n-gpu-layers 99 \
  --host 0.0.0.0 --port 8080

--n-gpu-layers 99 offloads all layers to GPU. If you hit VRAM limits, reduce this number to shift layers to CPU — useful when running 27B+ models on a 16 GB card.

Real benchmarks

Results from community benchmarks on ROCm 7.x, Ubuntu 24.04, Ollama and llama.cpp HIP:

Card	VRAM	Model	Quant	tok/s
RX 7900 XTX	24 GB	Llama 3.1 8B	Q4_K_M	66–96
RX 9070 XT	16 GB	Llama 3.1 8B	Q4_K_M	~56
RX 9070 XT	16 GB	Qwen3:14B	Q4	52.2
RX 9070 XT	16 GB	GPT-OSS:20B	Q4	91.9
RX 9070 XT	16 GB	Qwen3.5:27B-A3B	Q4 (MoE)	6.3

The 66–96 tok/s variance on the RX 7900 XTX reflects different llama.cpp versions and batch size settings across community tests. Mid-range is roughly 80 tok/s for a clean Q4_K_M Llama 8B run.

For comparison: an RTX 4070 Super (12 GB, 504 GB/s) delivers roughly 62–70 tok/s on the same model. The RX 9070 XT at 640 GB/s memory bandwidth edges it out per GB/s, and carries 4 GB more VRAM at a similar price ($669 vs ~$650 street, June 2026).

The MoE Qwen3.5:27B-A3B result (6.3 tok/s) deserves a note: the model technically fits in 16 GB at Q4, but the memory bus is saturated shuffling expert weights. Practical threshold for the RX 9070 XT is ~20B dense or ~30B MoE at Q4_K_M.

The gfx1201 bug: what happens and how to work around it

If you’re on an AMD Radeon RX 9070 XT and a model load reliably hangs then dies at the 2-minute mark, you’ve hit the rocBLASLt lookup bug. ROCm’s rocBLASLt library searches for gfx1200.dat instead of gfx1201.dat, fails to find it, and the model load process gets SIGKILL’d after the timeout.

Ollama log:

time=2026-06-04T12:34:56Z level=WARN msg="model load failed" error="signal: killed"

AMD engineers confirmed in ROCm/ROCm GitHub issue #5812 that HSA_OVERRIDE_GFX_VERSION=12.0.1 is not a proper solution — it’s a workaround that sometimes partially helps but doesn’t reliably fix the Tensile file lookup.

What actually works: the Docker path

AMD’s official ROCm Docker images include pre-built rocBLASLt libraries with correct gfx1201 support:

docker run --device=/dev/kfd --device=/dev/dri --group-add=video \
  -v $HOME/.ollama:/root/.ollama \
  -p 11434:11434 \
  rocm/dev-ubuntu-24.04:7.2.3 \
  bash -c "curl -fsSL https://ollama.com/install.sh | sh && ollama serve"

This adds ~8 GB of image overhead but eliminates the rocBLASLt issue. AMD has acknowledged the bug — a fix is expected in ROCm 7.2.4.

Alternative: Vulkan backend

llama.cpp’s Vulkan backend bypasses ROCm entirely and has no gfx1201 issue. For pure inference with no vLLM or PyTorch requirement, Vulkan on RDNA 4 is currently slightly faster than the ROCm HIP path for this chip anyway:

cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

Vulkan requires no extra driver install on Ubuntu 24.04 — the AMD Vulkan driver ships with the standard AMDGPU kernel module.

Common errors and fixes

/dev/kfd: Permission denied

You haven’t rebooted since adding yourself to render and video. A newgrp render session won’t work — log out completely or reboot.

rocminfo shows only Agent 1 (CPU), no GPU agent

Check lspci | grep -i amd to confirm the GPU is visible to the OS. If it appears in lspci but not rocminfo, run dmesg | grep amdgpu — firmware errors here are fixed by sudo apt install linux-firmware && sudo update-initramfs -u && reboot.

Ollama detects GPU but VRAM shows 0 MB

Ollama ships its own ROCm libraries under /usr/local/lib/ollama/rocm. A version mismatch between Ollama’s bundled libraries and your system ROCm can cause silent fallback to CPU. Fix: OLLAMA_DEBUG=1 ollama serve shows exactly which library path loads. If mismatched, remove system ROCm and let Ollama use its bundled version, or reinstall Ollama after ROCm is in place.

CUDA out of memory in llama.cpp HIP build

Despite the CUDA error string, this is a VRAM exhaustion error in the HIP backend — the error message is inherited from shared CUDA/HIP code paths. Reduce --n-gpu-layers to offload fewer layers to GPU, or lower --ctx-size to 2048 to cut KV cache VRAM usage.

Build fails with AMDGPU_TARGETS not found

Make sure ROCm’s HIP compiler (hipcc) is in your path: which hipcc. If missing, the rocm-hip-runtime package was not installed. Run sudo amdgpu-install --usecase=rocm --no-dkms again, then sudo apt install rocm-hip-sdk.

PyTorch and vLLM setup

For training workflows or high-concurrency serving with vLLM, AMD’s PyTorch ROCm 6.3 wheels install cleanly on Ubuntu 24.04:

pip3 install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/rocm6.3

Verify GPU detection:

python3 -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"
# Output: True / AMD Radeon RX 9070 XT

PyTorch uses HIP internally — torch.cuda still works because HIP mirrors the CUDA API. Pre-built vLLM wheels for ROCm 7.2.1 are maintained by AMD engineers and available via pip. For a full vLLM vs Ollama comparison on AMD, see vLLM vs Ollama: when each one wins.

Is AMD worth it on Ubuntu in 2026?

On Ubuntu specifically, yes — with caveats.

The RX 7900 XTX used market has settled around $800 on eBay (June 2026). At that price you get 24 GB VRAM, ~80 tok/s on Llama 3.1 8B Q4, and a ROCm install that just works without workarounds. The RTX 4090 on the used market runs $1,100–$1,200 for 24 GB. AMD wins on price per GB of VRAM, loses marginally on throughput per dollar.

The RX 9070 XT at $669 new is the RDNA 4 case for AMD: 640 GB/s bandwidth, 16 GB VRAM, ROCm officially supported. The Docker workaround for the gfx1201 bug is a speed bump, not a blocker, and it should be fixed in 7.2.4.

What AMD can’t match: Windows ROCm for RDNA 3 (still Linux-only), the breadth of CUDA-specific extensions for custom training code, and a decade of community debugging guides. For single-machine Ubuntu inference, fine-tuning with QLora, and vLLM serving, AMD’s gap with NVIDIA has narrowed to a few percentage points of throughput and an occasional driver bug.

If you’re spending your own money on Ubuntu local AI and don’t have a specific CUDA dependency, the math increasingly favors AMD. Cloud AMD compute (MI300X on RunPod) is also worth knowing about for workloads that exceed consumer VRAM.

FAQ

Does this guide work on Ubuntu 22.04?

Yes. AMD officially supports Ubuntu 22.04.5 LTS. Replace noble with jammy in the wget URL. Ubuntu 24.04 is preferred for newer hardware — 22.04’s kernel is older and may need more firmware updates for RDNA 4.

Can I use an RX 7600 or RX 6700 XT?

Not with ROCm. Consumer cards below RX 7700 XT aren’t in AMD’s ROCm support matrix. Use llama.cpp with -DGGML_VULKAN=ON — Vulkan works on a much wider range of AMD cards and doesn’t require ROCm drivers at all.

What’s the difference between ROCm and HIP?

ROCm is the full platform stack (drivers, runtime libraries, tools). HIP is the programming API within ROCm that mirrors CUDA semantics — it’s how PyTorch, llama.cpp, and vLLM talk to the hardware. When you build llama.cpp with -DGGML_HIP=ON, you’re compiling against the HIP API inside the ROCm stack.

I have a Ryzen AI Max+ 395 (Strix Halo). Does this apply?

The ROCm install procedure is the same, but the Strix Halo’s integrated GPU (gfx1151) has different performance characteristics from discrete cards. See the Ryzen AI Max+ 395 LLM guide for unified memory inference specifics.

Ollama or direct llama.cpp — which is faster on AMD?

Speed difference is minor for single-user inference. Ollama is easier to get running. Direct llama.cpp gives you more control over --n-gpu-layers and context window, which matters when you’re right at the VRAM ceiling. For multi-user serving, vLLM wins — see vLLM vs Ollama.

Sources

Last updated June 4, 2026. Prices and specs change; verify current rates before purchasing.

Recommended Gear

Was this article helpful?