ToolJoltTools

DeepSeek-V3 671B (MoE) VRAM Calculator

Estimate GPU memory to run DeepSeek-V3 671B (MoE) — weights, KV cache and overhead at FP16/INT8/INT4, with a fits-on-which-GPU verdict.

Weights (GB)
KV cache (GB)
Total VRAM needed (GB)

DeepSeek-V3 is a 671B-parameter MoE with 37B active per token and Multi-head Latent Attention (MLA), which compresses the KV cache to a 512-dim latent per token — far below classic GQA. All 671B MoE parameters stay resident; only ~37B are active per token.

Formula

VRAM ≈ 1.1 × (P×bytes(precision) + 2×layers×kv_heads×head_dim×ctx×batch×bytes(kv)) — DeepSeek-V3 671B (MoE): P=671B, layers=61, kv_heads=128, head_dim=56
References: DeepSeek-V3 671B (MoE) model card / config.json (Hugging Face); Kwon et al. (2023), Efficient Memory Management for LLM Serving with PagedAttention

Disclaimer: This tool is for general informational and estimation purposes only and is not professional financial, tax, accounting or legal advice. All figures are estimates — verify with a qualified professional before making decisions. Read the full disclaimer.

About DeepSeek-V3 671B (MoE) VRAM Calculator

This calculator estimates how much GPU memory (VRAM) you need to run DeepSeek-V3 671B (MoE) locally or in production. It sums the three real costs of inference: the model weights at your chosen precision (FP16, INT8 or INT4), the key-value attention cache that grows with context length and concurrent sequences, and ~10% runtime overhead for CUDA buffers and fragmentation. DeepSeek-V3 is a 671B-parameter MoE with 37B active per token and Multi-head Latent Attention (MLA), which compresses the KV cache to a 512-dim latent per token — far below classic GQA. Use the precision and context sliders to find the cheapest GPU that actually fits your workload instead of guessing from the parameter count alone.

How to use DeepSeek-V3 671B (MoE) VRAM Calculator

  1. 1Enter your values into DeepSeek-V3 671B (MoE) VRAM Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
  2. 2The result recomputes live using the formula shown on the page; there is no button to press.
  3. 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use DeepSeek-V3 671B (MoE) VRAM Calculator?

  • Computes DeepSeek-V3 671B (MoE) VRAM instantly in your browser — no sign-up, no upload, no server round-trip.
  • 100% free and unlimited, with the exact formula shown: VRAM ≈ 1.1 × (P×bytes(precision) + 2×layers×kv_heads×head_dim×ctx×batch×bytes(kv)) — DeepSeek-V3 671B (MoE): P=671B, l.
  • Runs entirely client-side, so every value you enter stays private on your device.
  • Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

How does MLA change the KV-cache math for DeepSeek-V3?+

Instead of caching full K and V per head, MLA caches one 512-dim compressed latent plus a 64-dim rotary key per token per layer: (512+64) × 61 layers × 2 bytes ≈ 70 KB per token at FP16 — comparable to a 7B GQA model despite 671B parameters.

What does it take to host DeepSeek-V3 weights?+

All 671B parameters must be resident: ~671 GB at FP8 (the native release format) or ~1.34 TB at BF16. That means a multi-node cluster (e.g. 2× 8×H200) regardless of the small active-parameter count, since the router may pick any expert.

How accurate is this DeepSeek-V3 671B (MoE) VRAM estimate?+

It uses the exact architecture from the model's config.json (layers, heads, KV heads, head dimension) and standard serving math, so weight and cache figures are typically within a few percent. Real usage varies with your inference engine's allocator, paged-attention block size and activation buffers.

Does quantizing the KV cache hurt quality?+

INT8/FP8 KV cache is widely used in production (vLLM, TensorRT-LLM) with negligible quality loss on most tasks, and it halves cache memory. It matters most for long contexts, where the cache rivals or exceeds the weight memory.

Related tools

Related ML & AI tools

Sponsored