ToolJoltTools

DeepSeek-V3 671B (MoE) KV-Cache Calculator

Per-token and total key-value cache memory for DeepSeek-V3 671B (MoE) across context length, batch size and cache precision.

โ€”
Per token (KB)
โ€”
Per sequence (GB)
โ€”
Total cache (GB)

DeepSeek-V3 671B (MoE): ~68.6 KB per token at FP16. DeepSeek-V3 is a 671B-parameter MoE with 37B active per token and Multi-head Latent Attention (MLA), which compresses the KV cache to a 512-dim latent per token โ€” far below classic GQA.

Formula

MLA cache/token = (512 latent + 64 rope) ร— layers ร— bytes = 576 ร— 61 ร— bytes (DeepSeek MLA compresses K,V into one latent)
References: DeepSeek-V3 671B (MoE) config.json (Hugging Face); Kwon et al. (2023), PagedAttention / vLLM; DeepSeek-AI (2024), DeepSeek-V2/V3 Technical Reports (MLA)

Disclaimer: This tool is for general informational and estimation purposes only and is not professional financial, tax, accounting or legal advice. All figures are estimates โ€” verify with a qualified professional before making decisions. Read the full disclaimer.

About DeepSeek-V3 671B (MoE) KV-Cache Calculator

The KV cache is the hidden memory cost of serving DeepSeek-V3 671B (MoE): every generated or prompted token stores its attention keys and values for reuse, and at long contexts this cache can rival the model weights themselves. This calculator uses DeepSeek-V3 671B (MoE)'s exact attention geometry โ€” DeepSeek's Multi-head Latent Attention, which caches a single compressed 576-dim latent per token per layer โ€” to give per-token, per-sequence and whole-batch cache sizes at FP16, FP8 and INT4 precision. Use it to size batch limits for your GPU or to see what a 128K-context request really costs.

How to use DeepSeek-V3 671B (MoE) KV-Cache Calculator

  1. 1Enter your values into DeepSeek-V3 671B (MoE) KV-Cache Calculator โ€” sensible, domain-typical defaults are pre-filled so you see a real result immediately.
  2. 2The result recomputes live using the formula shown on the page; there is no button to press.
  3. 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use DeepSeek-V3 671B (MoE) KV-Cache Calculator?

  • โœ“Computes DeepSeek-V3 671B (MoE) KV-Cache instantly in your browser โ€” no sign-up, no upload, no server round-trip.
  • โœ“100% free and unlimited, with the exact formula shown: MLA cache/token = (512 latent + 64 rope) ร— layers ร— bytes = 576 ร— 61 ร— bytes (DeepSeek MLA compresses K,V into one lat.
  • โœ“Runs entirely client-side, so every value you enter stays private on your device.
  • โœ“Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

How does MLA change the KV-cache math for DeepSeek-V3?+

Instead of caching full K and V per head, MLA caches one 512-dim compressed latent plus a 64-dim rotary key per token per layer: (512+64) ร— 61 layers ร— 2 bytes โ‰ˆ 70 KB per token at FP16 โ€” comparable to a 7B GQA model despite 671B parameters.

What does it take to host DeepSeek-V3 weights?+

All 671B parameters must be resident: ~671 GB at FP8 (the native release format) or ~1.34 TB at BF16. That means a multi-node cluster (e.g. 2ร— 8ร—H200) regardless of the small active-parameter count, since the router may pick any expert.

Why does the KV cache matter more than weights for serving throughput?+

Weights are paid once per GPU; cache is paid per concurrent request and per token of context. Batch size โ€” and therefore throughput โ€” is capped by how many sequence caches fit in the VRAM left after weights, which is exactly what this tool computes.

What does paged attention change?+

PagedAttention (vLLM) allocates the cache in fixed-size blocks on demand instead of reserving the full context up front, eliminating fragmentation and letting you overcommit. The per-token cost shown here is unchanged โ€” you just stop paying for unused reservation.

Related tools

Related ML & AI tools

Sponsored