Custom LLM VRAM Calculator (Any Architecture)

Full inference-memory budget for ANY transformer from raw config.json fields — weights, KV cache, overhead.

Parameters (B)num_hidden_layersnum_key_value_headshead_dimContext length (tokens)Concurrent sequencesWeight precisionKV precision

—

Weights (GB)

—

KV cache (GB)

—

Total VRAM (GB)

Our per-model calculators cover the famous checkpoints; this one covers everything else — paste num_hidden_layers, num_key_value_heads and head_dim from any config.json on the Hub and budget precisely.

Formula

VRAM ≈ 1.1 × [P×bytes + 2×layers×kv_heads×head_dim×ctx×batch×kv_bytes] — every variable straight from config.json

References: Hugging Face transformers config documentation; Kwon et al. (2023), PagedAttention

About Custom LLM VRAM Calculator (Any Architecture)

New model dropped an hour ago and no calculator covers it yet? Its config.json already contains everything needed: this tool takes the raw architecture fields — layer count, KV heads, head dimension — and computes the full serving budget: weights at your precision, the KV cache that scales with context and concurrency, and the ~10% runtime overhead every framework adds. It is the generic engine behind our per-model VRAM pages, exposed for the long tail of fine-tunes, merges, and architectures released faster than tooling follows.

How to use Custom LLM VRAM Calculator (Any Architecture)

1Enter your values into Custom LLM VRAM Calculator (Any Architecture) — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
2The result recomputes live using the formula shown on the page; there is no button to press.
3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Custom LLM VRAM Calculator (Any Architecture)?

✓Computes Custom LLM VRAM instantly in your browser — no sign-up, no upload, no server round-trip.
✓100% free and unlimited, with the exact formula shown: VRAM ≈ 1.1 × [P×bytes + 2×layers×kv_heads×head_dim×ctx×batch×kv_bytes] — every variable straight from config.json.
✓Runs entirely client-side, so every value you enter stays private on your device.
✓Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Where do I find these numbers for a Hub model?+

Open the repo's config.json: num_hidden_layers, num_key_value_heads (falls back to num_attention_heads if absent — that means MHA), and head_dim (or hidden_size ÷ num_attention_heads). Parameter count is on the model card or safetensors index.

Why multiply KV by 2?+

Keys AND values are cached separately per layer per head — two tensors of [ctx × head_dim] each. Architectures with MLA (DeepSeek) break this rule by caching one compressed latent; for those, our dedicated DeepSeek calculator applies.

Does the 10% overhead cover activation buffers?+

It covers framework workspace, CUDA context and fragmentation under paged allocators (vLLM-style). Naive HF-transformers generation without paged attention can need substantially more at long contexts — prefer a serving engine when the budget is tight.

How do I extend this for sliding-window models?+

If the runtime implements rolling-buffer cache, replace ctx with the window size for windowed layers (Mistral: 4096; Gemma 2: alternate layers). Without that support, the full-ctx figure here is what you'll actually pay — implementation determines cost.

Related tools

Related ML & AI tools

🧠

ROC-AUC Calculator (from TPR/FPR points)

Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.

● Live

🧠

Classification Threshold Cost Calculator

Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.

● Live

🧠

Silhouette Score Calculator

Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.

● Live