ToolJoltTools

KV-Cache Size Calculator (Any Model)

Generic per-token and total KV-cache memory from architecture fields — the long-context budgeting workhorse.

Per token (KB)
Total cache (GB)
Sequences per 24 GB (after 16 GB weights)

This one formula governs all long-context serving costs. The third output answers the practical question — how many 32K conversations fit beside a quantized 14B on a 24 GB card (answer with defaults: not many).

Formula

bytes/token = 2 (K,V) × layers × kv_heads × head_dim × precision_bytes · total = × context × batch
References: Kwon et al. (2023), PagedAttention / vLLM; Hooper et al. (2024), KVQuant

Disclaimer: This tool is for general informational and estimation purposes only and is not professional financial, tax, accounting or legal advice. All figures are estimates — verify with a qualified professional before making decisions. Read the full disclaimer.

About KV-Cache Size Calculator (Any Model)

Four architecture numbers and a context length determine whether your long-context dreams fit in VRAM: this calculator is the bare formula — 2 × layers × KV heads × head dimension × bytes — exposed with batch and precision dials, complementing our per-model pages for famous checkpoints. The third output converts the abstraction into the question people actually have: given a model already occupying most of a 24 GB card, how many long conversations can serve concurrently? The answer is usually the moment KV quantization stops being optional.

How to use KV-Cache Size Calculator (Any Model)

  1. 1Enter your values into KV-Cache Size Calculator (Any Model) — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
  2. 2The result recomputes live using the formula shown on the page; there is no button to press.
  3. 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use KV-Cache Size Calculator (Any Model)?

  • Computes KV-Cache Size instantly in your browser — no sign-up, no upload, no server round-trip.
  • 100% free and unlimited, with the exact formula shown: bytes/token = 2 (K,V) × layers × kv_heads × head_dim × precision_bytes.
  • Runs entirely client-side, so every value you enter stays private on your device.
  • Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Which models can I use this for?+

Any standard transformer with GQA/MHA/MQA — read num_hidden_layers, num_key_value_heads and head_dim from config.json. Exceptions: MLA models (DeepSeek — use our dedicated tool), sliding-window layers (cap ctx at the window for those layers), and hybrid SSM architectures like Jamba.

How much does KV quantization save and cost?+

FP8/INT8 halves cache with negligible quality loss (production-standard in vLLM/TRT-LLM). INT4 (KVQuant-style, often K in 4-bit + V in 8) reaches 3–4× with small but measurable long-context degradation. At 128K contexts, this is the difference between 1 and 4 concurrent users.

Does the cache really grow linearly forever?+

For full attention, yes — every past token stays resident. Sliding windows cap it; StreamingLLM keeps sinks + recent; cross-layer sharing (CLA) and MLA shrink the constant. But for the standard GQA stack this tool models, 1M tokens of context is 1M tokens of cache.

Why does prefill spike memory above this estimate?+

Prefill materializes activations for the whole prompt at once (and some engines allocate cache eagerly), so peak VRAM during a 100K-token prompt exceeds steady-state decode. Chunked prefill — processing the prompt in slices — flattens the spike at slight latency cost.

Related tools

Related ML & AI tools

Sponsored