Jun 2, 2026

Ollama MLX on Apple Silicon in 2026: What 2× Faster Inference Means for M-Series Mac Users

By RunAIHome Team · 13 min read

ollamaapple-siliconmlxlocal-aimacm4inference

TL;DR: Ollama 0.19 (March 31, 2026) swapped its llama.cpp Metal backend for Apple’s MLX framework on Mac, nearly doubling decode speed from ~58 to ~112 tokens/second on qualifying hardware. The hard requirement is 32GB or more of unified memory — 8GB and 16GB Macs still run the old Metal path with no change. If you clear that bar and run Qwen3.5-35B-A3B, the upgrade takes 30 seconds and the difference is real.

What you’ll be able to do after following this guide:

Enable the MLX backend in Ollama 0.19 and verify it’s actually active
Run Qwen3.5-35B-A3B at 70–80 tok/s on a Mac Mini M4 Pro 48GB
Understand why 16GB Macs are untouched and whether upgrading makes financial sense

Honest take: If your Mac has 32GB or more of unified memory, set OLLAMA_USE_MLX=1 this afternoon — you’ll feel the difference on the first response. If you’re on 16GB, nothing changed for you in this release, and that’s okay.

What Changed in Ollama 0.19

For its first three years, Ollama ran inference on Mac through llama.cpp’s Metal backend. That’s the same C++ engine that powers Ollama on Windows and Linux, ported to Apple’s GPU via Metal shaders. It worked, but it was designed for portability across very different hardware architectures — not to extract the last drop of performance from Apple Silicon’s unusual memory design.

Ollama 0.19, released March 31, 2026, changes the Mac foundation entirely. The new inference backend is built on Apple’s MLX framework — an open-source array computation library Apple released in December 2023 specifically for machine learning on Apple Silicon. Unlike the Metal-via-llama.cpp path, MLX treats unified memory as the architectural primitive from day one.

The result, benchmarked on M4 Max running Qwen3.5-35B-A3B at int4 quantization:

Metric	Ollama 0.18 (Metal)	Ollama 0.19 (MLX)	Change
Prefill (prompt processing)	1,154 tok/s	1,810 tok/s	+57%
Decode (generation speed)	57.8 tok/s	112 tok/s	+93%

Prefill matters for long context — the time between submitting a prompt and getting the first token back. Decode is what you feel during streaming output. At 112 tok/s, a 500-token response is fully rendered in under 5 seconds. At 58 tok/s, you’re waiting twice as long. Both are usable; the gap is noticeable.

Why MLX Beats llama.cpp Metal

Apple Silicon’s defining advantage for local AI is unified memory architecture (UMA): every M-series chip has a single physical memory pool shared by the CPU, GPU, and Neural Engine. There’s no VRAM vs. system RAM split, and no PCIe bus between them. A model loaded into memory is immediately accessible to the GPU without any copy overhead.

llama.cpp’s Metal backend understands this at the kernel level, but the library was built for cross-platform portability. It treats the GPU the way it would on any other system — with explicit paths for “moving data to the GPU” even when that move is a no-op on Apple Silicon.

MLX was designed with UMA as the first assumption, not an edge case to handle. Arrays live in the shared address space by default. The GPU accesses model weights directly, through a lazy evaluation graph that batches GPU dispatches more efficiently than llama.cpp’s imperative kernel dispatch model.

The outcome, as measured across multiple benchmarks: MLX typically delivers 20–40% higher token throughput for autoregressive generation vs. the llama.cpp Metal backend on the same Apple Silicon hardware. The 93% headline gain on M4 Max also reflects more aggressive utilization of its 410–546 GB/s memory bandwidth.

Hardware Requirements: The 32GB Floor

The Ollama 0.19 MLX backend has one hard requirement: 32GB or more of unified memory. If you’re below that, Ollama 0.19 falls back to llama.cpp Metal automatically — no error message, no speed change, no indication anything is different.

The floor isn’t arbitrary. The only model currently MLX-accelerated in this preview release — Qwen3.5-35B-A3B — needs roughly 20GB at Q4 quantization. A 32GB Mac runs it with ~12GB left for KV cache, which is workable for single-user chat. For longer documents or multi-turn sessions, 48GB is more comfortable.

Here’s where each current Mac configuration stands:

Mac	Memory options	MLX eligible?	Notes
MacBook Air M3 / M4	8GB, 16GB, 24GB	No	All configs below the 32GB floor
MacBook Pro M4	16GB, 24GB, 32GB, 48GB	32GB and 48GB	32GB is the practical entry point
Mac Mini M4	16GB, 32GB	32GB only	Affordable MLX entry at ~$799
Mac Mini M4 Pro	24GB, 48GB, 64GB	48GB and 64GB	Best value per tok/s for this use case
Mac Studio M4 Max	36GB, 64GB, 128GB	All configs	Full benefit; base at $1,999
Mac Studio M3 Ultra	96GB, 192GB	All configs	High bandwidth (800 GB/s)

The Mac Mini M4 Pro with 48GB at ~$1,799 is the sweet spot for Ollama MLX work. You get the M4 Pro’s 273 GB/s memory bandwidth, 48GB headroom for Qwen3.5 with context room to spare, and a significantly lower cost than the Mac Studio. We covered its broader local AI capabilities in our Mac Mini M4 Pro deep dive.

How to Enable the MLX Backend

Ollama 0.19 ships MLX as a preview feature — off by default. One environment variable activates it:

# Confirm you're on 0.19 or later
ollama --version
# ollama version 0.19.0

# Enable the MLX backend for this session
export OLLAMA_USE_MLX=1

# Start Ollama
ollama serve

In a second terminal, the server log will confirm which backend is active:

time=2026-03-31T09:12:04 level=INFO source=server.go msg="using mlx backend"

If you don’t see that log line, you’re still on llama.cpp Metal — either because you’re under 32GB of unified memory, or because the model you’re running isn’t yet MLX-accelerated (more on that below).

To make the setting permanent across reboots, add it to your shell profile:

echo 'export OLLAMA_USE_MLX=1' >> ~/.zshrc
source ~/.zshrc

If you run Ollama as a macOS launchd service via the desktop app, set the variable at the system level instead:

launchctl setenv OLLAMA_USE_MLX 1
# Then quit and relaunch Ollama.app

The Silent Fallback Problem

Here’s the issue people consistently run into: you set OLLAMA_USE_MLX=1, run a model, and it feels identical to before. No error. No warning. Just the same speed.

What’s happening: Ollama silently fell back to llama.cpp Metal because you’re running Llama 3.3, Mistral, or Phi — models not yet ported to the MLX backend. The variable is set, but MLX support doesn’t exist for that architecture yet, so Ollama does the sensible thing and keeps running.

The fix is checking the server output. If you see using mlx backend in the ollama serve log when you run a model, MLX is active. If you see nothing relevant, llama.cpp is handling it. This is a preview release — checking the log is the only reliable way to confirm what’s executing.

Performance by Chip

The MLX improvement scales with memory bandwidth, since bandwidth is the primary bottleneck for autoregressive token generation at large model sizes. Based on confirmed benchmarks and community measurements for Qwen3.5-35B-A3B Q4:

Chip	Memory bandwidth	Decode with Metal	Decode with MLX	Improvement
M4 Pro	273 GB/s	~45 tok/s	~75 tok/s (est.)	~+65%
M4 Max (32-core GPU)	410 GB/s	57.8 tok/s	112 tok/s	+93%
M4 Max (40-core GPU)	546 GB/s	~65 tok/s	~130 tok/s (est.)	~+100%
M3 Ultra	800 GB/s	~80 tok/s	~140 tok/s (est.)	~+75%

The M4 Max 32-core GPU numbers are the confirmed benchmarks from Ollama’s March 29, 2026 internal testing. The M4 Pro 75 tok/s number comes from community reports on 32GB MacBook Pro M4 Pro hardware. M4 Max 40-core GPU and M3 Ultra numbers are estimates scaled from confirmed bandwidth ratios — treat them as directional.

M5 Is a Different Story

Owners of M5, M5 Pro, or M5 Max Macs get an extra layer of acceleration on top of the MLX efficiency gain. Apple embedded GPU Neural Accelerators into every GPU core on M5-family chips — dedicated matrix multiplication hardware comparable to NVIDIA’s Tensor Cores, but tightly coupled to the unified memory pool. Ollama leverages these for both prefill and decode.

The measured results on M5 hardware:

General workloads: 30–60% faster than the equivalent M4 chip with MLX
Prompt processing specifically: 3–4× faster thanks to the Neural Accelerators handling matrix math in silicon rather than GPU shader code

If you’re running 35B MoE models interactively and prompt processing latency matters (long system prompts, document QA), the M5 generation gap is meaningful.

Which Models Are MLX-Accelerated Right Now

This is the key constraint of the preview release: only Qwen3.5-35B-A3B is MLX-accelerated in Ollama 0.19.

That’s Alibaba’s 35B mixture-of-experts model — 35B total parameters but only 3B active per token due to the MoE routing. At Q4 quantization it fits in 20GB, runs fast relative to its reasoning capability, and is a strong choice for coding and analysis tasks. If you already use Qwen3.5, Ollama MLX is an immediate, free upgrade.

For everything else — Llama 4, Mistral Small 4, Phi-4, Gemma 3 — you’re still on llama.cpp Metal.

What’s confirmed coming next:

Ollama 0.20: Gemma 4 MLX support confirmed
Later releases: Llama 4, Mistral, and Phi are the obvious next architectures — no official timeline

To check which backend is handling a specific model, run ollama serve in one terminal and watch the logs as you start inference in another. The using mlx backend message appears per-request when MLX is active.

The Case for 8GB and 16GB Mac Owners

Nothing in Ollama 0.19 changes the situation for sub-32GB Macs. The Metal backend is unchanged and still the default when MLX prerequisites aren’t met.

A MacBook Air M4 with 16GB runs 7B–13B models at 25–40 tok/s via llama.cpp Metal — fast enough for interactive use and comfortable for most single-user workflows. That hasn’t gotten worse; it just hasn’t gotten the MLX boost.

The question is whether to upgrade for MLX. The math:

7B–14B models: No reason to upgrade for MLX specifically. The Metal backend handles these well and a 7B model runs fine in 8GB.
35B MoE models: This is the exact use case the 32GB floor was designed for. Going from 45 tok/s to 75+ tok/s on an M4 Pro is where the upgrade pays off.
70B models: You need 40–80GB depending on quantization. That’s Mac Studio M4 Max territory ($1,999+) or a serious MacBook Pro.

If you’re running 35B+ models regularly and aren’t ready to spend $1,800 on Apple Silicon hardware right now, RunPod rents A40 and A100 instances (48–80GB VRAM) by the hour. It’s a practical way to access larger models while you decide if local hardware makes economic sense for your workload volume.

What Ollama MLX Doesn’t Fix

A few limitations worth knowing before you enable it:

Model support is limited in preview. Only Qwen3.5-35B-A3B today. If your workflow runs Llama 4, Mistral, or Phi, the flag does nothing useful for you yet.

MLX is Apple-only. If you have a mixed workflow — Mac for development, Linux server for production inference — MLX doesn’t help on the server side. The Ollama API stays identical, so you can run MLX locally and vLLM remotely without code changes, but you can’t port the MLX optimization.

Long context eats your headroom. At 32GB with a 20GB model, a 12GB KV cache fills up at roughly 90K tokens of context. For most chat use cases that’s fine; for document analysis with large inputs, watch your memory.

24GB is below the floor. A 24GB MacBook Air M4 does not qualify — Ollama’s preview requires more than 32GB of unified memory. The 32GB Mac Mini M4 is the cheapest entry point.

See our earlier breakdown of Ollama vs. LM Studio vs. llama.cpp for the baseline comparison that still applies on non-qualifying hardware.

Frequently Asked Questions

Does Ollama MLX work on Intel Macs? No. MLX is Apple Silicon only — it requires the unified memory architecture present in every M-series chip. Intel Macs continue to use the CPU path.

Will enabling OLLAMA_USE_MLX=1 break my existing setup or Open WebUI? No. The Ollama API is identical regardless of backend. If you use Open WebUI, Continue.dev, or any other frontend that talks to the Ollama REST API, enable MLX on the server side and all clients benefit automatically.

Why does Ollama require >32GB when standalone mlx-lm runs on 16GB? The standalone mlx-lm Python package can run smaller models (7B–14B) via MLX on 16GB machines. Ollama 0.19’s preview targets Qwen3.5-35B-A3B specifically, which needs ~20GB at Q4. The 32GB threshold reflects the minimum practical headroom for that specific model, not a fundamental MLX hardware limit.

My server log shows “using mlx backend” but speed seems the same. Why? Check that you’re running Qwen3.5-35B-A3B specifically. If the log shows MLX active but speed feels unchanged, you may have been comparing against a different baseline. Run a quick token count: start a generation and note how fast tokens stream. 70+ tok/s on 35B is noticeably faster than 45 tok/s.

When will Llama 4 be MLX-accelerated in Ollama? No official timeline from Ollama. Based on the pace of 0.19 → 0.20 changes, expect 2–3 more releases before major architecture coverage expands. Track the Ollama GitHub releases page for updates.

Sources

Last updated June 2, 2026. Hardware prices and Ollama release numbers change frequently; verify current specs before purchasing.

Recommended Gear

Mac Mini M4 Pro 48GB — best entry point for Ollama MLX with room to run Qwen3.5-35B-A3B comfortably
Mac Studio M4 Max — for 64GB+ unified memory and higher bandwidth when 48GB isn’t enough

Was this article helpful?