Jun 3, 2026

AMD Ryzen AI Max+ 395 (Strix Halo) for Local LLMs in 2026: 128GB Unified Memory, 100 t/s on 30B Models, and Whether It Beats a Discrete GPU

By RunAIHome Team · 13 min read

amdryzen-ai-maxstrix-halolocal-llmmini-pchardwarebenchmark

TL;DR: The AMD Ryzen AI Max+ 395 hits 100 t/s on Qwen3-30B and runs 120B models that physically don’t fit on any single consumer discrete GPU — in a $1,499–$1,999 mini PC. It’s bandwidth-constrained (256 GB/s vs 1,792 GB/s on an RTX 5090), so for models under 32B a discrete GPU is faster. The machine earns its price for one audience: people who need 70B+ fully in GPU memory, without a dedicated GPU tower.

	Strix Halo Mini PC	RTX 5060 Ti 16GB Build	Mac Mini M4 Pro 48GB
Best for	70B–120B models entirely in GPU memory	≤13B at 80–130 t/s, budget build	30–70B, silent, efficient
Price	$1,499–$1,999 (complete)	~$1,400 (complete build)	$1,399 (complete)
The catch	Bandwidth bottleneck; Linux preferred	Hard ceiling at 16GB VRAM	48GB max without $4,999 Ultra

Honest take: For Llama 3.3 70B or DeepSeek R1 70B fully in GPU memory without a dedicated GPU tower, the GMKtec EVO-X2 at $1,499 is hard to beat on x86 — but if you can live with 48GB, a Mac Mini M4 Pro is simpler and draws less than half the power.

What Strix Halo actually is

Strix Halo is AMD’s internal codename for the die inside the Ryzen AI Max+ 395. The unusual part is the memory architecture: 128 GB of LPDDR5X-8000 on a 256-bit bus, shared between the CPU and an integrated 40-compute-unit RDNA 3.5 GPU (the Radeon 8060S). There’s no PCIe bottleneck, no VRAM ceiling separate from system RAM — the GPU sees all 128 GB at full memory bandwidth.

In practice, the chip can allocate up to 96 GB to the GPU, leaving the remaining 32 GB for the OS and CPU-side workloads. That’s a larger GPU memory pool than any consumer discrete GPU, including the RTX 5090 (32 GB).

The rest of the spec sheet: 16 Zen 5 CPU cores clocked up to 5.1 GHz, a 50+ TOPS XDNA 2 NPU, and a configurable TDP range of 45W–120W with a 55W default. AMD fabbed it on TSMC 4nm. These chips ship in mini PCs — no PCIe card to install, no separate PSU math to worry about.

The actual benchmark numbers

All results below come from community testing on a Beelink GTR9 Pro running Ubuntu 24.04 with Mesa RADV (kisak PPA, version 26.0.6–26.1.1), llama.cpp builds b9049–b9467, Ollama 0.23.1, and AMD_VULKAN_ICD=RADV set for the Vulkan backend. Tested May–June 2026.

Model	Quantization	Backend	Generation (t/s)
Qwen3-30B-A3B (MoE)	IQ4_XS	RADV Vulkan	100.04
Qwen3-Coder 30B-A3B	Q4_K_S	RADV Vulkan	98.51
Qwen3-Coder 30B-A3B	UD-Q4_K_XL	RADV Vulkan	96.76
GPT-OSS 120B	MXFP4	RADV Vulkan	55.57
Qwen3.6	Q4_0 (speed-first)	RADV Vulkan	~81
Qwen3.6	balanced	RADV Vulkan	~63

100 t/s on a 30B model is comfortable real-time speed for single-user inference. The more striking number is GPT-OSS 120B at 55 t/s: a 120-billion-parameter model running entirely in unified memory at a speed that makes it useful for single-user chat.

Why MoE models run faster here: the 30B-A3B variants (Qwen3’s Mixture-of-Experts architecture) activate only ~3B parameters per forward pass despite having 30B total weights. On a bandwidth-constrained system, fewer weights loaded per token means directly higher tokens/sec. If you’re running Strix Halo hardware, prioritize MoE-architecture models — the performance advantage is significant.

The real-world bandwidth measurement confirms the constraint: the system delivers ~215 GB/s measured versus the theoretical 256 GB/s peak, a ~16% gap typical for LPDDR5X under mixed CPU+GPU load.

What fits in memory — and what doesn’t

The GPU can access up to 96 GB of the 128 GB pool. At Q4_K_M quantization:

Model	Approx. VRAM needed	Fits on Strix Halo?	Fits on RTX 4090 (24GB)?
Llama 3.3 70B	~42–48 GB	Yes — ~48 GB headroom left	No (CPU offload needed)
Qwen3-30B (dense)	~18 GB	Yes	Yes
DeepSeek R1 70B distill	~42 GB	Yes	No (CPU offload needed)
GPT-OSS 120B	~65–70 GB	Yes (tight)	No
DeepSeek R1 671B	~380 GB	No — needs multi-node	No
Llama 4 Maverick 402B	~230+ GB	No	No

An RTX 5060 Ti 16GB hits its ceiling around 13B Q4_K_M. An RTX 4090 at 24 GB tops out near 20B before requiring CPU offloading. On Strix Halo, Llama 3.3 70B loads entirely into the GPU memory pool — no CPU offloading, no PCIe bottlenecking. The VRAM math behind these numbers is covered in detail in How Much VRAM Do You Need for Llama Models.

Strix Halo vs a discrete GPU build

This is the decision that actually matters, and the answer is unambiguous in both directions.

Where discrete GPUs win: every model under 32 GB. An RTX 5090 generates ~186 t/s on Qwen3 8B Q4_K_M. The same model on Strix Halo runs around 80–90 t/s. Memory bandwidth is the reason: 1,792 GB/s on the RTX 5090 vs ~215 GB/s real-world on Strix Halo. For a daily 7B or 14B coding assistant — see the local AI coding stack with Continue.dev + Ollama — a mid-range discrete GPU outperforms Strix Halo and often costs less.

Where Strix Halo wins: any model above 24 GB that you need running fully in GPU memory. An RTX 4090 can’t load Llama 3.3 70B without splitting layers to CPU RAM, which drops generation speed to 2–5 t/s. Strix Halo loads it in ~40 seconds and generates at ~30–35 t/s. That’s a 6–15× speed difference on the same model.

The cloud comparison matters here too. Running Llama 3.3 70B on RunPod costs $0.29–$0.59/hour depending on GPU availability. At $1,499 for a GMKtec EVO-X2 running 6 hours/day, you break even at roughly 700–1,400 hours of use — around 4–8 months of daily active use. After that, every inference is free. We ran this calculation in detail in the RunPod vs local GPU: when to rent vs buy article.

The Mac comparison

The closest comparison is the Mac Mini M4 Pro, which starts at $1,399 with 24 GB unified memory and maxes out at 48 GB for $1,799. Its memory bandwidth is 273 GB/s — slightly above Strix Halo’s real-world 215 GB/s.

For models that fit in 48 GB, the Mac Mini M4 Pro holds three advantages: substantially better power efficiency (20–30W under LLM load vs 60–120W for Strix Halo mini PCs), meaningfully more mature software (Metal via MLX is better-tuned than AMD’s Vulkan/ROCm path on Linux), and quieter operation under sustained load.

Strix Halo’s advantage is the 128 GB tier. If you need the full 96 GB GPU pool for 70B+ models, the Mac route to 128 GB requires the Mac Studio M4 Ultra at $4,999. A $1,499–$1,999 Strix Halo mini PC delivers 96 GB GPU memory at roughly one-third the price — the software experience is rougher, but the hardware value is real.

What you can actually buy right now

Prices verified June 2026:

Machine	Memory	Storage	Price	Notes
GMKtec EVO-X2	128GB LPDDR5X	2TB NVMe	~$1,499	Best value, 2.5GbE
Beelink GTR9 Pro	128GB LPDDR5X	2TB NVMe	$1,899–$1,999	Dual 10GbE, better cooling
MINISFORUM MS-S1 Max	128GB LPDDR5X	2TB NVMe	~$2,299	Available on Newegg
GMKtec EVO-X2 (64GB)	64GB LPDDR5X	1TB NVMe	~$1,099	GPU pool ~48GB, still runs 70B

The GMKtec EVO-X2 at $1,499 is the price-performance sweet spot. It has the same CPU and GPU as the Beelink GTR9 Pro and omits the dual 10GbE NICs — which you don’t need for single-user home inference. The Beelink’s dual 10GbE matters if you’re running a shared home AI server. For that use case, the Open WebUI multi-user setup guide covers the server configuration.

On the 64GB variant: at 64 GB, the GPU’s accessible pool drops to roughly 48 GB, which still loads Llama 3.3 70B Q4_K_M comfortably with room for context. The 64GB EVO-X2 at ~$1,099 is a legitimate entry point if budget is tight and you don’t need 120B models.

Setting up for LLM inference

Linux (Ubuntu 24.04): the fast path

Ollama auto-detects Vulkan on AMD GPUs. Start by installing Mesa RADV from the kisak PPA for the best-performing drivers:

sudo add-apt-repository ppa:kisak/kisak-mesa
sudo apt update && sudo apt install mesa-vulkan-drivers vulkan-tools

Verify the GPU is visible to Vulkan:

vulkaninfo --summary | grep deviceName
# Expected output: AMD Radeon 8060S (or similar Radeon gfx1151 variant)

Then install Ollama and load a model:

curl -fsSL https://ollama.com/install.sh | sh
ollama run qwen3:30b

Ollama offloads to the Vulkan GPU automatically. For maximum throughput with llama.cpp directly, set AMD_VULKAN_ICD=RADV before running benchmarks.

Common problem: model stuck on CPU

If ollama ps shows CPU next to a loaded model, Vulkan detection failed. Fix:

# Confirm Vulkan sees the AMD GPU:
vulkaninfo 2>/dev/null | grep -i amd
# If empty: reinstall mesa-vulkan-drivers from the kisak PPA

# Force RADV over AMDVLK (if both are installed):
export AMD_VULKAN_ICD=RADV

Critical BIOS settings for full VRAM access

In your mini PC’s BIOS, set UMA Frame Buffer Size to 512MB and disable IOMMU. Without this, the GPU may only see a fraction of the 128 GB pool. Also add these kernel parameters to /etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=31457280"

Run sudo update-grub after editing, then reboot. Skip this step and you may see the GPU capped at 4–16 GB even on a 128 GB machine.

ROCm alternative

The gfx1151 architecture (Strix Halo) is listed as Preview support in AMD’s ROCm compute compatibility matrix. ROCm works for llama.cpp and Ollama inference, but requires setting HSA_OVERRIDE_GFX_VERSION=11.5.1 in your environment. Most users get better results with less setup via the Mesa RADV Vulkan path described above.

Windows

ROCm has no official Windows support for Strix Halo. LM Studio 0.3.x with its Vulkan backend works out of the box and is the practical choice for non-technical Windows users. Performance runs roughly 20–30% below the Linux/RADV path on the same hardware.

Power consumption math

At 120W sustained (full TDP, running a 120B model), the Beelink GTR9 Pro draws:

120W × 24h × $0.12/kWh = $0.35/day = $10.37/month

For a more realistic schedule — 6 hours/day active inference at ~100W average:

100W × 6h × $0.12/kWh = $0.072/day = $2.16/month

For comparison, a full desktop with an RTX 4090 draws 400–500W under GPU load. At the same 6h/day schedule that’s $8.64–$10.80/month in electricity — roughly 4–5× more for the same hours of use. The Strix Halo mini PC wins on power efficiency because there’s no separate discrete GPU, and the chip’s 55W default TDP is genuinely low for what it delivers.

The full running-cost calculation framework is in the power bill math for 24/7 AI servers article.

Honest take

The Ryzen AI Max+ 395 mini PCs have a clear use case: you want 70B+ local inference in GPU memory, without a GPU tower, on a $1,500–$2,000 budget, and you’re comfortable with Linux.

The GMKtec EVO-X2 at $1,499 is the entry point. The Beelink GTR9 Pro at $1,999 adds dual 10GbE for home server use. Both machines are complete systems — just add a monitor, configure Ubuntu, and you’re running 70B models in an afternoon.

If you need 70B in-memory but strongly prefer Apple’s ecosystem and software maturity, the Mac Mini M4 Pro at $1,399 (48 GB) is the simpler path — better software, quieter, draws half the power. The constraint: 48 GB tops, unless you step up to the $1,999 Mac Mini M4 Pro Max or the $4,999 Mac Studio M4 Ultra.

For anything under 32B — and especially for fast coding-assistant workflows — skip both Strix Halo and Apple Silicon. A used RTX 3090 in a desktop delivers twice the generation speed on 7–14B models for $500–$600, as detailed in the used RTX 3090 value analysis.

FAQ

Does the RTX 4090 beat Strix Halo on 70B models?
No. The RTX 4090 has 24 GB VRAM, so Llama 3.3 70B Q4_K_M (~42 GB) requires CPU offloading, which drops generation speed to 2–5 t/s. Strix Halo runs the same model fully in GPU memory at ~30–35 t/s — a 6–15× difference on that specific workload.

Can I use ROCm instead of Vulkan?
Yes, but the setup is more involved. The gfx1151 architecture is Preview status in AMD’s ROCm compute compatibility list. Set HSA_OVERRIDE_GFX_VERSION=11.5.1 in your environment. Most users get better results and simpler setup from the Mesa RADV Vulkan path.

What’s different between Strix Halo and a standard Ryzen APU like the 8845HS?
The memory bus. The Ryzen 7 8845HS uses a 128-bit bus with ~88 GB/s bandwidth and maxes out at 32 GB RAM. Strix Halo uses a 256-bit bus at 256 GB/s and supports 128 GB. That difference is what separates “runs 7B models” from “runs 70B models entirely in GPU memory.”

Is the 64GB model worth buying?
The 64 GB variant caps the GPU pool at ~48 GB, which comfortably fits Llama 3.3 70B Q4_K_M. If your primary target is 70B inference, the 64 GB EVO-X2 at ~$1,099 delivers that capability. Step up to 128 GB only if you want 120B models or large context windows on 70B.

Does Windows work well for LLM inference on Strix Halo?
LM Studio 0.3.x with Vulkan backend works without configuration. Ollama on Windows also works but may need manual GPU forcing. Raw throughput is 20–30% lower than the Linux/RADV path. If you’re primarily a Windows user and performance matters, consider a dual-boot Ubuntu setup for inference workloads.

Sources

Last updated June 3, 2026. Prices and benchmark results may change as driver and model releases evolve; verify current rates before purchasing.

Recommended Gear

Was this article helpful?