Jun 16, 2026

MiniMax M3 Local AI Hardware Guide 2026: The 428B Open-Weight Model You (Probably) Can't Run at Home

By RunAIHome Team · 12 min read

minimax-m3local-llmvrammoegpumac-studio

TL;DR: MiniMax M3 is a 428B-parameter Mixture-of-Experts model (~23B active) with frontier-tier coding scores and a 1M-token context. The honest problem: even a 4-bit GGUF is ~265GB, and the one machine that could have run it — a maxed Mac Studio — got gutted by the 2026 DRAM shortage. For almost every home lab, the API is the move; wait for community distills before buying hardware.

	MiniMax M3 (local)	MiniMax M3 (API)	A “runnable” local model (Qwen3.6 35B-A3B)
Best for	Labs with 256GB+ RAM already on hand	Everyone who wants M3’s quality today	Anyone with a single 24GB GPU
Price / Cost	$10k+ in hardware, if you can source the RAM	$0.30/$1.20 per M tokens (launch promo)	~$1,070 used RTX 3090
The catch	Q4 GGUF is 265GB — no single consumer box fits it	Not local; data leaves your machine	Not frontier-tier, but actually runs

Honest take: M3 is a genuinely impressive open-weight model, but in June 2026 it’s a data-center model wearing an “open weights” badge. Run it on the API, keep your local stack on Qwen3.6 or Gemma 4, and revisit when memory prices fall or a distilled M3 lands.

What MiniMax M3 actually is

MiniMax released M3 on June 1, 2026. It’s an open-weight Mixture-of-Experts model with roughly 428B total parameters and ~23B activated per token, spread across 256 fine-grained experts. The headline feature is a 1-million-token context window (1,048,576 tokens, with up to 512K output tokens) made practical by a new attention design MiniMax calls MSA — MiniMax Sparse Attention.

MSA keeps a Grouped-Query Attention backbone and layers block-level sparse selection on top of real, uncompressed key-values. MiniMax reports more than 9× faster prefill and more than 15× faster decoding at 1M context versus the previous M2 generation, with per-token compute cut to roughly 1/20th. That’s the part that matters for anyone thinking about long-context agentic work: the speedup isn’t from a smaller model, it’s from not attending to every token.

On benchmarks, M3 punches at the frontier. It scores 59.0% on SWE-Bench Pro, 66.0% on Terminal-Bench 2.1, 74.2% on MCP Atlas, and 83.5 on BrowseComp — the last figure beating Claude Opus 4.7’s 79.3 on autonomous browsing. It edges past GPT-5.5 and Gemini 3.1 Pro on the coding/agent metrics while sitting just below Claude Opus 4.8 overall.

One important correction, because a lot of the early write-ups got this wrong: M3 is not the 229B model. That was MiniMax M2.7. M3 is the bigger 428B MoE with native multimodal input (text, image, and video). If a guide tells you M3 is 229.9B/9.8B-active, it’s quoting the wrong generation.

The number that breaks the dream: 265GB

Here’s where the “open weight” story collides with physics. The full BF16 weights are about 855GB. Nobody runs that at home. So the question is what the quantized GGUFs look like — and Unsloth has published Dynamic 2.0 quants for exactly this model.

Quant (Unsloth Dynamic 2.0)	Disk / memory footprint	What can hold it
UD-IQ1_M (1-bit)	~128 GB	Quality falls off a cliff; not recommended
UD-Q2_K_XL (2-bit)	~143 GB	Needs 192GB+ RAM realistically
UD-Q4_K_XL (4-bit)	~265 GB	The “real” quant — needs 320GB+
Q8_0 / UD-Q8_K_XL	~453–464 GB	Multi-GPU server territory

For local LLM work, Q4_K_M-class quantization is the floor for keeping a model’s quality intact. For M3 that’s the 265GB UD-Q4_K_XL file — and that’s just weights, before KV cache and context allocation. Push toward long context and you’re adding tens of gigabytes on top.

To put 265GB in perspective: that’s eleven RTX 3090s’ worth of VRAM (24GB each), and you’d want a twelfth for headroom. At June 2026 used prices — a RTX 3090 averages around $1,070, with listings ranging $966–$1,189 — that’s roughly $12,000–$13,000 in GPUs alone, before the motherboard, PCIe risers, PSUs, and the power bill to feed twelve 350W cards. Even then, llama.cpp’s NVIDIA support for M3 is preliminary, and naive PCIe sharding of a model this size is slow.

If you’ve been following the GDDR7 shortage and NVIDIA’s consumer GPU freeze, you already know this is the worst possible moment to be buying twelve high-VRAM cards.

The Mac angle — and why it just collapsed

For a model this size, the usual home-lab answer is unified memory: one Apple Silicon box with a huge RAM pool that the GPU can address directly. A 256GB or 512GB Mac Studio used to be the cleanest way to run a 400B-class MoE without a GPU farm.

That option is gone as of June 2026. Here’s the timeline:

March 2026: Apple pulled the 512GB unified-memory option from the Mac Studio M3 Ultra and raised the 256GB upgrade price by $400, citing the same AI-driven DRAM squeeze we covered in the DDR5 and SSD price surge.
May 2026: Apple removed the 256GB option too.
June 2026: The M3 Ultra Mac Studio ships with 96GB as its only memory configuration.

So the device that was supposed to be the answer — a maxed Mac Studio — now tops out at 96GB. That doesn’t fit even the 143GB 2-bit quant, let alone the 265GB Q4. And to be clear about the spec confusion floating around: there is no “Mac Studio M4 Ultra.” Apple shipped the 2025 Mac Studio with an M4 Max base and an M3 Ultra at the top; there was never an M4 Ultra SKU. Any guide promising “M3 at Q4 on an M4 Ultra 192GB” is describing a machine that doesn’t exist running a quant that wouldn’t fit if it did.

The M4 Max Mac Studio starts at $1,999 but caps even lower on memory. For context on where Apple Silicon genuinely shines for local AI, our Mac Studio M4 Max vs Mac Mini M4 Pro guide covers the models you can actually buy and run today.

What speed should you even expect?

Throughput data for M3 specifically is thin this early, but the architecture tells you most of what you need. Decode speed on any quantized LLM is memory-bandwidth-bound, not compute-bound — the reason NPUs with big TOPS numbers still lose on tokens/second. With ~23B active parameters per token, M3 decodes like a ~23B model per step, but every token still has to stream the active expert weights out of memory.

For a concrete reference point: the previous-gen MiniMax M2.5 (229B) at Q4_K_M was clocked at roughly 12 tokens/sec on an RTX PRO 6000 Blackwell — a 96GB workstation card. M3 is nearly twice the total size, so even on hardware that can hold a usable quant, expect low-double-digit tokens/sec at best, dropping further as you fill that 1M context. That’s usable for batch/agentic work, painful for interactive chat. (More on where that 96GB ceiling sits in our RTX PRO 6000 Blackwell deep dive.)

If you genuinely want to run it locally

Say you already have a 256GB-plus RAM server or a multi-GPU rig and you want to try. Here’s the honest setup path as of mid-June 2026.

1. llama.cpp support is preliminary. M3 isn’t in a released llama.cpp build yet. You build from the open pull request:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git fetch origin pull/24523/head:minimax-m3
git checkout minimax-m3
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j --target llama-cli llama-server

2. Pull the Unsloth Dynamic GGUF. For a CPU-RAM-heavy box, UD-Q2_K_XL (143GB) is the realistic entry; UD-Q4_K_XL (265GB) if you have the memory. Expect to keep most weights in system RAM and offload only what fits in VRAM:

./build/bin/llama-cli \
  --model MiniMax-M3-UD-Q2_K_XL.gguf \
  --threads 32 \
  --ctx-size 32768 \
  --n-gpu-layers 8 \
  --temp 1.0 --top-p 0.95 --top-k 40

MiniMax’s recommended sampling is temperature=1.0, top_p=0.95, top_k=40. Start --ctx-size small (32K) and grow it — the KV cache for 1M context is enormous and will OOM a borderline rig instantly.

3. Watch the gotchas. Early Dynamic-quant releases of MiniMax MoE models have had reports of garbled “thinking” output under specific CUDA versions. If reasoning traces come out as noise, drop to a known-good quant or pin your CUDA toolkit version, and check the Unsloth model card’s discussions before burning hours on a download this large.

If you don’t have the hardware but want to experiment without committing $12k, rent it: a multi-GPU instance on RunPod lets you load a Q4 GGUF for a few dollars an hour and decide whether M3’s quality justifies a permanent build. For a 400B-class model, “try before you buy the GPU farm” is the only sane order of operations.

The math that actually wins: just use the API

This is the uncomfortable conclusion for a local-AI site to write, but the numbers are the numbers. MiniMax M3’s API costs $0.60 input / $2.40 output per million tokens at regular pricing — and $0.30 / $1.20 during the launch promo. Compare that to GPT-5.5 ($5/$30) and Claude Opus 4.8 ($6.25/$25): M3 delivers comparable coding/agent quality at roughly 8–20% of the cost.

Run the breakeven. Twelve used RTX 3090s plus a platform to host them is ~$13,000–$15,000, draws ~2–3kW under load, and gets you maybe 12 tokens/sec. At $1.20 per million output tokens, that $13,000 buys you over 10 billion output tokens through the API — years of heavy agentic use — with zero electricity, zero PCIe debugging, and full 1M context working out of the box. The local build only wins on one axis: data never leaving your machine. If that’s a hard compliance requirement, the math changes. For everyone else, it doesn’t.

This is the same calculus we ran for Kimi K2.6 and the EXO distributed-inference experiment: once a model crosses ~200GB at usable quant, “local” stops meaning “affordable” and starts meaning “data-center-at-home.”

What to actually run locally instead

If your goal is a private, local coding/agent stack — not specifically this model — the answer hasn’t changed. On a single 24GB card like a used RTX 3090 or a RTX 5090, Qwen3.6 35B-A3B runs at ~120 tok/s and Gemma 4 31B gives you dense-model depth. Both are genuinely local, genuinely fast, and genuinely free to run. Our open-source LLM shootout breaks down which family fits 8GB, 16GB, and 24GB.

M3 belongs in your toolbox — through the API — for the jobs where its 1M context and frontier agentic scores actually earn their keep. Keep it off your shopping list until either community distills shrink it to a single-GPU footprint, or DRAM prices fall far enough that a 256GB+ box stops costing as much as a used car.

FAQ

Can I run MiniMax M3 on a 24GB GPU like an RTX 3090 or 4090? No. Even the 2-bit GGUF is ~143GB and the usable 4-bit quant is ~265GB. A single 24GB card can’t hold a small fraction of it. You’d need roughly a dozen 24GB GPUs, or a unified-memory box with 256GB+ — which Apple no longer sells.

Why does everyone say M3 is 229B parameters? They’re confusing it with MiniMax M2.7, which was the 229B model. M3 is the newer ~428B MoE (~23B active) with native multimodal input and a 1M context window.

Is MiniMax M3 actually open source? It’s open-weight. MiniMax pledged to release downloadable weights and a technical report within 10 days of the June 1 launch, likely under a modified-MIT-style license. As of mid-June the exact license terms were still being finalized — “open weights” here means you can download and self-host, not necessarily unrestricted commercial use with training data included.

What’s the cheapest realistic way to use M3 today? The API. At $0.30/$1.20 per million tokens (launch promo) it’s roughly a tenth of GPT-5.5 or Claude Opus 4.8 pricing. For local experimentation without a permanent build, rent a multi-GPU instance on RunPod by the hour.

Will there be a smaller M3 that runs on consumer hardware? MiniMax hasn’t announced one, but the pattern across the open-weight world (distills, “mini” variants, community quants) makes it likely. If a 30–70B-class M3 distill appears, that’s the version home-lab builders should watch for.

Recommended Gear

RTX 3090 (used, 24GB) — still the value pick for a single-GPU local stack, ~$1,070 in June 2026.
RTX 5090 — current-gen 32GB option for faster single-card inference.
RTX PRO 6000 Blackwell — 96GB workstation card; the realistic single-GPU ceiling before multi-GPU.
Mac Studio M3 Ultra / Mac Studio M4 Max — unified-memory boxes, now memory-capped by the 2026 DRAM squeeze.

Sources

Last updated June 16, 2026. Prices, quant sizes, and licensing terms change; verify current rates and the official MiniMax weight release before purchasing hardware.

Was this article helpful?