May 8, 2026

When NOT to Use a NAS for Local LLMs (and the 1 Case Where It Works)

By RunAIHome Team · 10 min read

naslocal-aillmsynologyqnaphardwarehome-labbuying-guide

Here is the thing nobody writing about “NAS for AI” will say plainly: your NAS CPU generates tokens at 1–5 tokens per second on a 7B model. A human reads at roughly 4–5 words per second — about 3–4 tokens per second assuming average English token density. Your NAS is barely keeping up with your reading speed, on a small model, with no other users on the box.

That is not “slow AI.” That is broken AI.

This article is not about whether your NAS can technically install Ollama. It can. This is about whether you should — and what you should do instead.

Why NAS vendors are suddenly selling “AI”

Synology shipped its AI Tools package. QNAP has “AI Core.” Both companies have marketing pages positioning their hardware as local AI inference nodes. This is not dishonest advertising — their hardware does run LLMs — but the gap between “runs” and “usable” is enormous.

The vendors want NAS in the AI conversation because it keeps existing customers from buying a separate AI server. The result is a flood of forum posts from people who spent $600 on a DS923+ expecting it to run a responsive local AI assistant and are now staring at a terminal waiting 45 seconds for a single reply.

Understanding why requires a brief look at what actually limits LLM inference.

LLM inference is bottlenecked by memory bandwidth, not compute

When a language model generates a token, it does not perform a massive compute operation. It reads the entire model weight file from memory, runs a relatively simple matrix multiply, and outputs one token. Then it does it again for the next token.

The result: tokens per second scales almost linearly with memory bandwidth, not CPU clock speed or core count. A processor with twice the memory bandwidth generates tokens roughly twice as fast, even at a lower clock rate.

This is why GPU inference crushes CPU inference — an RTX 3090’s GDDR6X memory delivers 936 GB/s of bandwidth. A MacBook Pro M3’s unified memory does around 150 GB/s. Your NAS CPU, with its laptop-grade DDR4 in a 10W TDP envelope, is working with a fraction of that.

NAS CPUs: what the specs actually look like

The table below covers the processors in the most common mid-range NAS units as of May 2026.

NAS Model	CPU	Cores / Threads	Memory Config	Max Bandwidth	Expected tok/s (7B Q4)
Synology DS920+, DS420+	Intel Celeron J4125	4C / 4T, 2.0 GHz	DDR4-2400, single-channel	~19 GB/s	1–2.5 tok/s
Synology DS923+, DS1522+	AMD Ryzen Embedded R1600	2C / 4T, 2.6–3.1 GHz	DDR4-2400, dual-channel	~38 GB/s	2.5–4 tok/s
QNAP TS-473A	AMD Ryzen Embedded V1500B	4C / 8T, 2.2 GHz	DDR4-2400, dual-channel	~38 GB/s	4–6 tok/s
RTX 3090 (reference, GPU)	—	—	GDDR6X	936 GB/s	50–70 tok/s
RTX 4060 Ti 16GB (reference)	—	—	GDDR6	288 GB/s	30–50 tok/s

Sources for benchmark ranges: NAScompares (Dec 2024 NAS AI roundup), NeedToKnowIT (QNAP Ollama testing, 2024–2025). The QNAP TS-473A’s Ryzen V1500B with AVX2 SIMD support does reach the high end of the 4–6 tok/s range on small quantized models.

The ceiling is 6 tok/s on the fastest consumer NAS hardware. For a coding assistant responding to a prompt, you need 30–50 tok/s before it feels instantaneous. Even 10–15 tok/s feels sluggish for interactive chat. The NAS ceiling is 6.

The four use cases where NAS inference fails

1. Interactive chat (the main thing people want)

Below 10 tok/s, conversation feels like talking to someone who pauses for several seconds between every sentence. Below 5 tok/s, it’s genuinely irritating after the first exchange. At 1–2 tok/s — the Celeron zone — a 200-token reply takes 90 seconds to two minutes.

If your goal is a home AI assistant that feels responsive, a NAS is not the answer.

2. AI coding assistant

Code completions need to arrive in under a second to not break your flow. Tools like Continue.dev, Cline, and Aider send small prompts but expect near-instant completions. At 4 tok/s, a 50-token completion — a single function signature — takes 12 seconds. You will reach for the NAS-powered assistant once and go back to API-based tools immediately.

3. Image generation

Stable Diffusion, Flux, and SDXL do not run on CPU inference in any practical sense. A single 512×512 image at 20 steps on a Celeron J4125 can take 30–60 minutes. These workflows require a GPU. Full stop.

4. Batch document processing (where you might think it works)

Summarizing or classifying documents overnight in a queue sounds like a use case where “slow but steady” is acceptable. In practice, running llama.cpp at full load on a NAS CPU competes directly with the NAS’s own storage I/O, network throughput, and other running services. Most users report that enabling AI inference on their NAS degrades the box’s primary job — serving files — noticeably. Running a 24/7 batch job at 100% CPU on a device with passive cooling and a 10–15W thermal budget is also a fast path to reduced SSD lifespan and higher-than-expected fan noise.

The 1 case where a NAS genuinely helps: model storage

Here is what a NAS does well: store large files reliably, serve them over a local network at high speed, and hold a lot of them.

A 70B model at Q4_K_M quantization is about 35 GB. A 32B model is ~18 GB. A 14B model is ~9 GB. If you are running a home lab with five or six models loaded on rotation, you are looking at 80–150 GB of model files that you would otherwise scatter across SSDs on multiple machines.

A NAS with four 4TB drives gives you 12–16 TB of storage. You can keep your entire model library — dozens of GGUFs, LoRA adapters, ComfyUI checkpoints — in one place, backed up, accessible to every machine on your network.

The architecture looks like this:

NAS (model storage only)
  └─ /models/llms/*.gguf
  └─ /models/comfyui/checkpoints/
  └─ /models/comfyui/loras/

Inference server (separate machine with GPU)
  └─ Mounts NAS share via SMB or NFS
  └─ Ollama model directory → /mnt/nas/models/llms/
  └─ ComfyUI model directory → /mnt/nas/models/comfyui/

The inference server loads a model from the NAS into GPU VRAM once at startup. After that, all token generation happens in VRAM — the NAS is not involved during inference at all. Load time over a gigabit LAN connection for a 35 GB model is about 5–6 minutes. Over 10GbE, under a minute.

This is a real, working architecture that home lab enthusiasts use to share model libraries across a desktop workstation, a gaming PC, and a local server without duplicating 100 GB of files on every machine’s SSD.

Setting it up with Ollama

On the inference machine (Linux/macOS):

# Mount NAS share
sudo mount -t cifs //nas-ip/models /mnt/nas -o username=user,password=pass

# Point Ollama at the NAS
export OLLAMA_MODELS=/mnt/nas/models/llms
ollama serve

On Windows, map the NAS share as a network drive (e.g., Z:\) and set OLLAMA_MODELS=Z:\llms in Ollama’s environment config.

The one catch: if your NAS goes offline or the mount drops, Ollama will fail to load models until it’s restored. Keep a small local SSD with one or two frequently-used models as a fallback.

What to actually buy if you want local AI inference

If you are budget-shopping for inference capability, the same $600–700 you would spend on a mid-range NAS goes much further on GPU hardware:

Used RTX 3090 24GB (~$650–750): 936 GB/s bandwidth, handles Llama 3 70B at Q4 in VRAM (24GB fits it), 50–70 tok/s for 7B. The single best value card for serious local AI. Full evaluation in our used RTX 3090 guide.
RTX 4060 Ti 16GB (~$450–480 new): 288 GB/s, 16GB VRAM handles 14B models cleanly and 30B with mild offload. Better power efficiency than the 3090 (165W vs 350W). See our GPU buying guide for the full tier breakdown.
Mac mini M4 with 32GB unified memory (~$1,100): Not a NAS, but worth mentioning — 32GB of 120 GB/s unified memory handles 30B models comfortably, and Apple Silicon’s memory architecture is well-optimized for llama.cpp.

If you already own a NAS and want to add inference capability, a used RTX 3090 in a cheap desktop (or PCIe slot in an existing tower) is the move — not buying a “better NAS.” See the RTX 5060 Ti vs RTX 3090 total cost comparison for a head-to-head on today’s best inference value options.

For how much system RAM the inference machine needs alongside its GPU, see our system RAM guide for local LLMs.

Not ready to buy hardware yet? If you want to test a 70B model or a specific fine-tune before committing $600+ to a GPU, RunPod rents RTX 4090 and A100 instances by the hour. Run your target workflow for a few dollars, confirm it actually solves your problem, and then decide whether local hardware makes sense for your usage pattern. Cloud rental is the right first step if you have any doubt about the model tier you actually need day-to-day.

Honest take

NAS hardware is excellent at what it was designed to do: reliable, always-on storage with a good network share layer and a low-power footprint. None of those properties translate into useful LLM inference.

The vendors are not lying when they say their hardware “supports AI.” They mean it technically runs. They do not mean it runs at the speed where you will actually use it.

Use your NAS if you have one. The model-storage architecture above is real and useful — particularly if you are managing model libraries across multiple machines. But if someone is asking you whether to buy a NAS for local AI, the answer is no. Buy a used GPU for the same money, pair it with a tower you already have, and keep your model files on whatever storage you already own.

The one case where NAS and AI combine well is the same case where NAS always works well: storage.

Sources

Last updated May 8, 2026. Hardware prices and NAS software features change regularly; verify current specs before purchasing.