Transformer Parameter Count Calculator

Count parameters of a decoder-only transformer from layers, hidden size, FFN width, vocab and GQA heads.

LayersHidden sizeFFN intermediate sizeSwiGLU models: ~3.5×H (Llama 3 8B: 14336)Vocab sizeQuery headsKV heads (GQA)Embedding tying

—

Attention params (B)

—

FFN params (B)

—

Embedding params (B)

—

Total (B)

Defaults reproduce Llama 3 8B: the formula lands within ~1% of the official 8.03B. Swap in any config.json values to audit a model's advertised size.

Formula

P ≈ L·(2H² + 2·H·kv·head_dim) [attention, GQA] + L·3·H·F [SwiGLU FFN] + (1 or 2)·V·H [embeddings] + 2LH [norms]

References: Vaswani et al. (2017), Attention Is All You Need; Llama 3 config.json (Hugging Face)

About Transformer Parameter Count Calculator

Model cards advertise parameter counts, but where do those billions actually live? This calculator rebuilds a decoder-only transformer's size from its architecture: attention projections (with grouped-query attention correctly shrinking K/V), the SwiGLU feed-forward block that typically holds two-thirds of all weights, embedding tables, and RMSNorm vectors. The defaults reproduce Llama 3 8B within one percent — change any field to see exactly which design choice costs parameters. It is the fastest way to sanity-check a paper's claimed size or to design your own training run's budget.

How to use Transformer Parameter Count Calculator

1Enter your values into Transformer Parameter Count Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
2The result recomputes live using the formula shown on the page; there is no button to press.
3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Transformer Parameter Count Calculator?

✓Computes Transformer Parameter Count instantly in your browser — no sign-up, no upload, no server round-trip.
✓100% free and unlimited, with the exact formula shown: P ≈ L.
✓Runs entirely client-side, so every value you enter stays private on your device.
✓Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Where do most of a transformer's parameters live?+

In the FFN. With SwiGLU at F≈3.5H, the feed-forward block is ~65–70% of non-embedding parameters; attention is ~25–30%. That is why MoE models replicate experts on the FFN — it is where parameter scaling pays most.

Why does GQA barely change the total parameter count?+

GQA shrinks only the K and V projections (from H×H to H×kv·head_dim each). For Llama 3 8B that saves ~0.4B parameters (5%) — its real benefit is the 4× smaller KV cache at inference, not the weight saving.

What is embedding tying and who uses it?+

Tying reuses the input embedding matrix as the output head, saving V×H parameters. Gemma and many small models tie (their huge vocabs would otherwise dominate); Llama 3 does not. The toggle shows the difference instantly.

Why does my count differ slightly from the official number?+

Small terms: biases (most modern LLMs have none), the final norm, rotary embeddings (no parameters), or rounding in reported figures. Within ~1–2% is expected; larger gaps usually mean a different FFN width or vocab than you assumed.

Related tools

Related ML & AI tools

🧠

ROC-AUC Calculator (from TPR/FPR points)

Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.

● Live

🧠

Classification Threshold Cost Calculator

Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.

● Live

🧠

Silhouette Score Calculator

Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.

● Live