How to find the best local LLM for your hardware: 5 benchmark tools compared (2026)

local-llmbenchmarkinghardwarewhichllmlocalscorellama.cpptutorialgpu

TL;DR: Picking a local LLM by parameter count is the wrong signal — a well-quantized 14B can outperform a crushed 27B, and a model that barely fits your VRAM will stall at under 10 tok/s. These five tools automate the math: whichllm ranks what to run in one command, LocalScore measures how fast your hardware actually is, and llama-bench gives you the raw throughput numbers to validate both.

whichllmLocalScorellama-bench
Best for”What model should I run?""How fast is my actual chip?”Raw tok/s baseline for any config
Input requiredYour GPU (auto-detected)GPU + GGUF model fileAny compiled GGUF + llama.cpp
OutputRanked model list with quantPP speed, TG speed, TTFTtok/s table across batch/thread configs
The catchScores rely on merged leaderboards, not local runsSingle-GPU setups onlyNo quality signal — speed only

Honest take: Run whichllm first, get a ranked list in under 10 seconds, then validate the top pick’s tok/s with llama-bench on your machine before committing to a multi-GB download.


The “fit the biggest model your VRAM holds” heuristic has two failure modes. First, a 14B Q3 can outperform a 7B Q8 on general reasoning and lose badly on code — parameter count is not a quality proxy once quantization enters the picture. Second, a model that barely squeezes into 8GB at Q4 will offload key-value cache to system RAM when context grows past a few thousand tokens, dropping you from 40 tok/s to under 10 tok/s mid-conversation.

What you actually need is a three-part filter: quality score from real-world evals, verified VRAM fit at your preferred quantization, and measured tokens per second on your specific chip. The five tools below cover that stack.

1. whichllm — one command, ranked results

whichllm is a Python CLI that auto-detects your GPU, CPU, and RAM, then ranks HuggingFace models by a merged benchmark score rather than parameter count. It hit v0.5.7 on May 19, 2026, and has 2,000 GitHub stars since its March 2026 launch.

Install (pick any):

uvx whichllm@latest       # one-off run, no persistent install
uv tool install whichllm  # persistent
pip install whichllm
brew install andyyyy64/whichllm/whichllm

Hardware auto-detected: NVIDIA GPUs via nvidia-ml-py, AMD GPUs via ROCm/dbgpu, Apple Silicon via Metal, plus CPU core count, system RAM, and available disk space.

How the 0–100 score is built:

  • LiveBench, Artificial Analysis Index, and Aider scores (live-merged, highest weight)
  • Chatbot Arena ELO and Open LLM Leaderboard v2 (frozen, lower recency weight)
  • A log₂-scaled model-size bonus as a knowledge proxy
  • A quantization penalty — lower-bit variants take a multiplicative hit
  • A runtime-fit penalty: partial offload (layers spilling to system RAM) scores 0.72×, CPU-only runs score 0.50×
  • Speed adjustment: ±8 points based on estimated tok/s performance

That last factor matters. A model that gets 22 tok/s on your 8GB card scores lower than the same model would on a 24GB card running it fully on-GPU — not because the model changed but because partial offload degrades the experience in a way pure-quality benchmarks miss.

Real results by GPU (May 2026 snapshot):

GPUTop pickQuantTok/sScore
RTX 5090 32 GBQwen3.6-27BQ6_K~4094.7
RTX 4090 24 GBQwen3.6-27BQ5_K_M~2792.8
RTX 4090 24 GB (alt)Qwen3-32BQ4_K_M~3183.0
RTX 4060 8 GBQwen3-14BQ3_K_M~2271.0
Apple M3 Max 36 GBQwen3.6-27BQ5_K_M~989.4

The gap between the 8GB card (71.0) and the 4090 (92.8) reflects both the model quality ceiling and the Q3 quant penalty — not purely chip speed. An 8GB owner running a Q3 14B gets measurably worse reasoning quality than a 24GB owner running a Q5 27B, independent of tok/s. If you’re deciding whether the 16GB step-up is worth $50 on the RTX 5060 Ti, that quality difference is the actual argument — see the full breakdown at /blog/rtx-5060-ti-8gb-vs-16gb-local-ai-2026/.

The honest limitation: whichllm derives its scores from community leaderboard data, not from tests run on your machine. It gives you the best model for your hardware class; your specific chip, driver version, and cooling headroom may produce different throughput numbers. Use it to shortlist, then validate with llama-bench.

2. LocalScore — measure your specific chip

LocalScore is a Mozilla Builders project that runs a standardized test battery on a GGUF model you supply, then (optionally) uploads your result to the community database at localscore.ai. It’s built on Llamafile, itself a portable wrapper around llama.cpp, which gives it cross-platform coverage on Windows, Linux, and macOS without requiring a CUDA compile.

Three metrics it measures:

  1. Prompt Processing (PP) — tokens per second ingesting context. Matters for RAG pipelines with long document chunks and multi-turn conversations with large history.
  2. Token Generation (TG) — tokens per second producing output. This is the number users feel as “speed.”
  3. Time to First Token (TTFT) — milliseconds before the first character appears. Critical for interactive use; high TTFT makes a fast model feel slow.

These combine into a single LocalScore number you can compare against the community database. Before benchmarking anything yourself, search your GPU model on localscore.ai — there’s a good chance someone has already measured the model you’re evaluating on identical or similar hardware.

Limitation: LocalScore supports single-GPU setups only. Multi-GPU NVLink configs and CPU+GPU hybrid inference are outside its current scope.

For open-source tooling in this space more broadly, aifoss.dev tracks LocalScore alongside other self-hosted AI benchmarking projects.

3. llama-bench — the raw throughput baseline

llama-bench ships inside llama.cpp and is the closest thing to a ground-truth speed measurement for single-process inference. If you have llama.cpp compiled, you already have it at ./llama-bench.

# Minimal: tests both prompt processing and generation
./llama-bench -m model.gguf -ngl 99

# Test multiple batch sizes in one pass
./llama-bench -m model.gguf -ngl 99 -b 128,256,512

# Three repetitions for stable averages
./llama-bench -m model.gguf -ngl 99 --repetitions 3

-ngl 99 offloads all layers to GPU. Omit it and you’ll measure a hybrid CPU+GPU config, which may look fine but won’t tell you what a fully-loaded inference run actually does. Always include it for apples-to-apples comparisons.

Output reports pp512 (prompt processing at a 512-token context, in tok/s) and tg128 (token generation for 128 output tokens, in tok/s), each with standard deviation across repetitions. These are the numbers most hardware reviewers publish, which makes them the most useful for cross-referencing community benches.

What llama-bench won’t tell you: quality. A Q2 model might generate 80 tok/s while producing garbled answers. Always pair throughput numbers with a quality benchmark like whichllm’s score, or run a quick LiveCodeBench sample before declaring a setup usable.

4. llama-benchy — benchmark across inference backends

llama-benchy addresses a gap that matters for anyone comparing serving backends: llama-bench only works with llama.cpp directly. If you want equivalent throughput numbers for Ollama, vLLM, or SGLang on the same model and hardware, you need a different tool.

pip install llama-benchy
# or via Docker
docker run hellohal2064/llama-benchy

llama-benchy auto-detects which backend is running and sends standardized requests to each, measuring client-perceived latency rather than the internal C++ timing that llama-bench captures. On a fast GPU, Ollama’s API server layer typically adds 5–15% overhead compared to raw llama.cpp — a difference llama-bench’s direct measurement misses entirely.

If you’re deciding between vLLM and Ollama (the real choice at 70B+ models and multi-user concurrency — full numbers at /blog/vllm-vs-ollama-when-each-wins-2026/), llama-benchy is how you validate that choice on your own hardware configuration rather than trusting someone else’s benchmark run on a different driver version and OS.

5. ollama-benchmark — quick Ollama-native testing

Already running Ollama and don’t want to touch a GGUF file or compile anything? aidatatools/ollama-benchmark is a Streamlit web app that pulls from Ollama’s local API and reports output tok/s, prompt processing tok/s, and average request latency for any model already loaded with ollama pull.

It’s the fastest path from “I want to compare two models” to actual numbers, with no file management required. The trade-off: you’re measuring Ollama’s full stack, including the API overhead. Numbers will run 5–15% lower than raw llama.cpp for the same model on the same hardware — acceptable for relative comparisons within Ollama, but don’t cite them directly against llama-bench results as equivalent.

What each GPU tier realistically runs

The tools above give you personalized numbers; here’s the general map to validate against before downloading anything:

GPUVRAMComfortable ceilingTypical tok/s range
RTX 4060 / 4060 Ti 8 GB8 GB7–8B Q4_K_M40–65 tok/s
RTX 4060 Ti 16 GB16 GB14B Q5_K_M35–50 tok/s
RTX 3090 24 GB24 GB27B Q5_K_M20–30 tok/s
RTX 4090 24 GB24 GB27B Q6_K or 32B Q4_K_M27–31 tok/s
RTX 5090 32 GB32 GB27B Q6_K (no offload)~40 tok/s
Mac Mini M4 Pro 24 GB24 GB unified14B Q6_K25–35 tok/s

A note on 70B models and single 24GB cards: you can technically load Llama 3.1 70B at Q2 on a 24GB card, but Q2 degrades the model’s reasoning quality to roughly the level of a well-quantized 30B. The used RTX 3090 is still the best value at the 24GB tier, but if 70B is your actual target, two cards in NVLink or an upgrade to Mac Studio is the practical path, not a tighter quant.

No GPU at all? RunPod rents RTX 4090 instances from $0.34/hr (Community Cloud) to $0.69/hr (Secure Cloud) — a practical way to benchmark a 27B model before committing to a hardware purchase. Run llama-bench on the rented instance to confirm tok/s matches your expectations, then buy the hardware with confidence.

The quantization sweet spot

Before benchmarking anything, pick your quantization correctly. The numbers here are well-established:

  • Q8_0: near-lossless (approximately 0.01 perplexity increase versus F16), roughly half the file size of F16, the right choice when VRAM isn’t limiting you
  • Q5_K_M: strong quality, noticeably smaller than Q8, the practical step down when VRAM is tight
  • Q4_K_M: roughly 75% smaller than F16 with approximately 3.5% quality loss — the default for most GPU tiers
  • Q3_K_M: noticeable degradation, use only when VRAM forces the issue
  • Q2: significant quality hit, last resort for fitting oversized models on undersized cards

The consistent finding from 2026 community benchmarks: the quality cliff is between Q3 and Q4, not between Q4 and Q8. Going from Q4_K_M to Q8_0 costs real VRAM for minimal quality gain. Going from Q4_K_M to Q3_K_M costs real quality for moderate VRAM savings. Q4_K_M is the sensible default for a reason.

For coding specifically, the step from Q4 to Q5 is more noticeable than in general reasoning — if code generation is your primary use case, the extra VRAM cost of Q5_K_M is usually justified. Full perplexity and quality-loss numbers across quantization levels at /blog/quantization-q4-q5-q6-q8-quality-loss-2026/.

The practical workflow

  1. Run uvx whichllm@latest — note the top-ranked model and quant for your GPU tier. Takes under 10 seconds.
  2. Search localscore.ai for your GPU + that model combination. If a community result exists, you have a validated tok/s estimate before downloading anything.
  3. Pull or download the model in the recommended quant. For Ollama: ollama pull <model>:<quant-tag>. For llama.cpp: grab the GGUF from HuggingFace.
  4. Run llama-bench or ollama-benchmark to confirm actual tok/s on your machine matches the estimate.
  5. If tok/s is lower than expected, check for partial CPU offloading. In Ollama: ollama ps shows GPU layer count. In llama.cpp: look for “offloaded X/Y layers” in startup output. Fix options: reduce context size, drop one quant level, or add VRAM.

Frequently Asked Questions

What’s the difference between whichllm’s score and llama-bench’s tok/s? whichllm’s 0–100 score is a quality signal: how well the model performs on reasoning, coding, and conversation benchmarks. llama-bench tok/s is a speed signal: how fast your hardware processes that model. Both matter independently. A model scoring 92 at 3 tok/s is unusable for interactive work; a model hitting 80 tok/s with a score of 40 may be too degraded for complex tasks. You need both to make an informed choice.

Why does my RTX 4090 run some 27B models slower than others of the same parameter count? Memory bandwidth is the bottleneck for token generation, not compute. The RTX 4090’s 1,008 GB/s memory bandwidth sets a ceiling on how fast it can load model weights per token. Models with different attention architectures, head counts, or MoE layouts (like a 35B MoE vs. a dense 27B) have different memory access patterns at similar parameter counts, producing different throughput numbers even when both fit fully in VRAM.

Can these tools run on Apple Silicon? Yes. whichllm auto-detects Apple Silicon via Metal and selects appropriate quants. LocalScore runs on macOS via the Llamafile binary. llama-bench compiles with Metal backend support. Apple Silicon’s unified memory means tok/s scales differently than discrete GPU: an M3 Max (36 GB) runs Qwen3.6-27B Q5_K_M at roughly 9 tok/s — slower than an RTX 4090 on the same model, but that M3 Max fits models that a 24GB discrete card would offload to system RAM.

What happened to HellaSwag and HumanEval? Both are effectively dead for ranking purposes in 2026. Models have been trained on or near their test sets, making scores unreliable as quality proxies. whichllm deliberately demotes these in its weighting. The active replacements: LiveCodeBench and BigCodeBench for coding, RULER for long-context evaluation, and LiveBench for general reasoning.

Is there a database I can browse without running anything? localscore.ai maintains a community-submitted database searchable by GPU model. The Open LLM Leaderboard v2 on HuggingFace covers quality scores but not hardware-specific throughput. For Apple Silicon tok/s by chip and model, llmcheck.net maintains a searchable database with standardized methodology.

Sources

Last updated May 28, 2026. Model rankings and hardware prices change frequently; verify current data before purchasing.

  • RTX 5090 — 32 GB VRAM, top whichllm scores on 27B models at Q6_K
  • RTX 4090 — 24 GB VRAM, best performance-per-dollar for 27B inference
  • RTX 4060 Ti 16 GB — 16 GB budget option for 14B models at Q5_K_M
  • RTX 3090 — used 24 GB option, strong value for 27B inference at Q4_K_M
  • RTX 4060 — 8 GB entry tier, comfortably runs 7–8B models at 40+ tok/s

Was this article helpful?