Custom LLM VRAM Calculator (Any Architecture)
Full inference-memory budget for ANY transformer from raw config.json fields — weights, KV cache, overhead.
Our per-model calculators cover the famous checkpoints; this one covers everything else — paste num_hidden_layers, num_key_value_heads and head_dim from any config.json on the Hub and budget precisely.
Formula
About Custom LLM VRAM Calculator (Any Architecture)
New model dropped an hour ago and no calculator covers it yet? Its config.json already contains everything needed: this tool takes the raw architecture fields — layer count, KV heads, head dimension — and computes the full serving budget: weights at your precision, the KV cache that scales with context and concurrency, and the ~10% runtime overhead every framework adds. It is the generic engine behind our per-model VRAM pages, exposed for the long tail of fine-tunes, merges, and architectures released faster than tooling follows.
How to use Custom LLM VRAM Calculator (Any Architecture)
- 1Enter your values into Custom LLM VRAM Calculator (Any Architecture) — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
- 2The result recomputes live using the formula shown on the page; there is no button to press.
- 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.
Why use Custom LLM VRAM Calculator (Any Architecture)?
- ✓Computes Custom LLM VRAM instantly in your browser — no sign-up, no upload, no server round-trip.
- ✓100% free and unlimited, with the exact formula shown: VRAM ≈ 1.1 × [P×bytes + 2×layers×kv_heads×head_dim×ctx×batch×kv_bytes] — every variable straight from config.json.
- ✓Runs entirely client-side, so every value you enter stays private on your device.
- ✓Live recompute as you type, with a worked example and authoritative references for trust.
Frequently asked questions
Where do I find these numbers for a Hub model?+
Open the repo's config.json: num_hidden_layers, num_key_value_heads (falls back to num_attention_heads if absent — that means MHA), and head_dim (or hidden_size ÷ num_attention_heads). Parameter count is on the model card or safetensors index.
Why multiply KV by 2?+
Keys AND values are cached separately per layer per head — two tensors of [ctx × head_dim] each. Architectures with MLA (DeepSeek) break this rule by caching one compressed latent; for those, our dedicated DeepSeek calculator applies.
Does the 10% overhead cover activation buffers?+
It covers framework workspace, CUDA context and fragmentation under paged allocators (vLLM-style). Naive HF-transformers generation without paged attention can need substantially more at long contexts — prefer a serving engine when the budget is tight.
How do I extend this for sliding-window models?+
If the runtime implements rolling-buffer cache, replace ctx with the window size for windowed layers (Mistral: 4096; Gemma 2: alternate layers). Without that support, the full-ctx figure here is what you'll actually pay — implementation determines cost.
Related tools
Related ML & AI tools
ROC-AUC Calculator (from TPR/FPR points)
Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.
● LiveClassification Threshold Cost Calculator
Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.
● LiveSilhouette Score Calculator
Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.
● Live