LLM VRAM Calculator — Will It Fit Your GPU? (2026)

Pick a model, a quantization level, and a context length. This tool estimates how much GPU memory the model actually needs to run inference, and tells you which consumer cards can fit it. The math is based on each model's published architecture and the real bits-per-weight of GGUF quantization formats — the method is spelled out below.

Model

Quantization

Context length: 8K

GPU / device	VRAM	Fits?

How this calculator works

Total VRAM for local LLM inference comes from three parts:

Model weights = parameters × bytes-per-weight. A 7B model in Q4_K_M is roughly 7,000,000,000 × 0.60 bytes ≈ 3.9 GB. FP16 is 2 bytes/weight; the GGUF k-quants use the bits-per-weight values published by the llama.cpp project (Q8_0 ≈ 8.5 bpw, Q6_K ≈ 6.56, Q5_K_M ≈ 5.67, Q4_K_M ≈ 4.83, Q3_K_M ≈ 3.91, Q2_K ≈ 3.35).
KV cache = 2 × layers × kv-heads × head-dim × context × 2 bytes (FP16 cache). This grows linearly with context length and is why a 128K context can need more memory than the weights themselves. Models using grouped-query attention (GQA, most modern models) have far smaller KV caches than older multi-head models.
Overhead — CUDA/ROCm context, the compute graph, and activation buffers. We add a flat ~1 GB, which matches what Ollama and llama.cpp typically reserve in practice.

The fit verdict leaves ~10% headroom, because a card that is 100% full will stutter or fail to load. If a model is close to your limit, you can shrink the KV cache by lowering context, or use a smaller quant. For the full picture on choosing models by VRAM, see our best local AI models by VRAM guide and how much VRAM Llama models need.

FAQ

Q: Why does my actual VRAM usage differ from this estimate?

A: Real usage depends on the inference engine, KV-cache quantization, batch size, and how much context you actually fill. This tool estimates the typical peak for single-stream inference at the full context you select. Ollama and llama.cpp can also offload some layers to system RAM, which lowers VRAM use at the cost of speed.

Q: Can I run a model that's slightly bigger than my VRAM?

A: Yes — with partial offload (some layers on CPU/RAM) or a smaller quant. A Q4_K_M model that barely overflows 12 GB will usually fit comfortably at Q3_K_M with a minor quality hit.

Q: Does quantizing the KV cache help?

A: At long context, yes — an 8-bit KV cache roughly halves the cache size shown here. Most engines support it, with a small quality cost. This calculator assumes the default FP16 cache.

Q: What about Apple Silicon?

A: Macs use unified memory, so the "VRAM" is shared with the system. Leave ~6–8 GB for macOS and apps; the table accounts for typical usable memory on common configs.

Estimates are for planning only. Architecture figures come from each model's public config; bits-per-weight from the llama.cpp quantization tables. Verify against your own setup before buying hardware.