ToolJoltTools

Llama 3.1 405B KV-Cache Calculator

Per-token and total key-value cache memory for Llama 3.1 405B across context length, batch size and cache precision.

โ€”
Per token (KB)
โ€”
Per sequence (GB)
โ€”
Total cache (GB)

Llama 3.1 405B: ~504.0 KB per token at FP16. Llama 3.1 405B is the largest open-weights dense model; even at FP8 the weights need ~406 GB, so it is a multi-node deployment (e.g. 8ร— H100 80 GB with FP8).

Formula

cache/token = 2(K,V) ร— layers ร— kv_heads ร— head_dim ร— bytes = 2 ร— 126 ร— 8 ร— 128 ร— bytes
References: Llama 3.1 405B config.json (Hugging Face); Kwon et al. (2023), PagedAttention / vLLM

Disclaimer: This tool is for general informational and estimation purposes only and is not professional financial, tax, accounting or legal advice. All figures are estimates โ€” verify with a qualified professional before making decisions. Read the full disclaimer.

About Llama 3.1 405B KV-Cache Calculator

The KV cache is the hidden memory cost of serving Llama 3.1 405B: every generated or prompted token stores its attention keys and values for reuse, and at long contexts this cache can rival the model weights themselves. This calculator uses Llama 3.1 405B's exact attention geometry (8 KV heads ร— 128-dim heads ร— 126 layers) to give per-token, per-sequence and whole-batch cache sizes at FP16, FP8 and INT4 precision. Use it to size batch limits for your GPU or to see what a 128K-context request really costs.

How to use Llama 3.1 405B KV-Cache Calculator

  1. 1Enter your values into Llama 3.1 405B KV-Cache Calculator โ€” sensible, domain-typical defaults are pre-filled so you see a real result immediately.
  2. 2The result recomputes live using the formula shown on the page; there is no button to press.
  3. 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Llama 3.1 405B KV-Cache Calculator?

  • โœ“Computes Llama 3.1 405B KV-Cache instantly in your browser โ€” no sign-up, no upload, no server round-trip.
  • โœ“100% free and unlimited, with the exact formula shown: cache/token = 2(K,V) ร— layers ร— kv_heads ร— head_dim ร— bytes = 2 ร— 126 ร— 8 ร— 128 ร— bytes.
  • โœ“Runs entirely client-side, so every value you enter stays private on your device.
  • โœ“Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

What hardware does Llama 3.1 405B actually need?+

Meta's reference serving setup is a full 8ร— H100 80 GB node (640 GB) running FP8 weights (~406 GB) with the remainder for KV cache. At BF16 you need two such nodes with pipeline or tensor parallelism across 16 GPUs.

How big is the 405B KV cache at the full 128K context?+

With GQA (8 KV heads, head_dim 128) one 128K-token sequence costs about 2 ร— 126 ร— 8 ร— 128 ร— 131072 ร— 2 bytes โ‰ˆ 63 GB at FP16 โ€” nearly a full H100 per request, which is why long-context serving uses FP8 KV and paged attention.

Why does the KV cache matter more than weights for serving throughput?+

Weights are paid once per GPU; cache is paid per concurrent request and per token of context. Batch size โ€” and therefore throughput โ€” is capped by how many sequence caches fit in the VRAM left after weights, which is exactly what this tool computes.

What does paged attention change?+

PagedAttention (vLLM) allocates the cache in fixed-size blocks on demand instead of reserving the full context up front, eliminating fragmentation and letting you overcommit. The per-token cost shown here is unchanged โ€” you just stop paying for unused reservation.

Related tools

Related ML & AI tools

Sponsored