Transformer Parameter Count Calculator
Count parameters of a decoder-only transformer from layers, hidden size, FFN width, vocab and GQA heads.
Defaults reproduce Llama 3 8B: the formula lands within ~1% of the official 8.03B. Swap in any config.json values to audit a model's advertised size.
Formula
About Transformer Parameter Count Calculator
Model cards advertise parameter counts, but where do those billions actually live? This calculator rebuilds a decoder-only transformer's size from its architecture: attention projections (with grouped-query attention correctly shrinking K/V), the SwiGLU feed-forward block that typically holds two-thirds of all weights, embedding tables, and RMSNorm vectors. The defaults reproduce Llama 3 8B within one percent โ change any field to see exactly which design choice costs parameters. It is the fastest way to sanity-check a paper's claimed size or to design your own training run's budget.
How to use Transformer Parameter Count Calculator
- 1Enter your values into Transformer Parameter Count Calculator โ sensible, domain-typical defaults are pre-filled so you see a real result immediately.
- 2The result recomputes live using the formula shown on the page; there is no button to press.
- 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.
Why use Transformer Parameter Count Calculator?
- โComputes Transformer Parameter Count instantly in your browser โ no sign-up, no upload, no server round-trip.
- โ100% free and unlimited, with the exact formula shown: P โ L.
- โRuns entirely client-side, so every value you enter stays private on your device.
- โLive recompute as you type, with a worked example and authoritative references for trust.
Frequently asked questions
Where do most of a transformer's parameters live?+
In the FFN. With SwiGLU at Fโ3.5H, the feed-forward block is ~65โ70% of non-embedding parameters; attention is ~25โ30%. That is why MoE models replicate experts on the FFN โ it is where parameter scaling pays most.
Why does GQA barely change the total parameter count?+
GQA shrinks only the K and V projections (from HรH to Hรkvยทhead_dim each). For Llama 3 8B that saves ~0.4B parameters (5%) โ its real benefit is the 4ร smaller KV cache at inference, not the weight saving.
What is embedding tying and who uses it?+
Tying reuses the input embedding matrix as the output head, saving VรH parameters. Gemma and many small models tie (their huge vocabs would otherwise dominate); Llama 3 does not. The toggle shows the difference instantly.
Why does my count differ slightly from the official number?+
Small terms: biases (most modern LLMs have none), the final norm, rotary embeddings (no parameters), or rounding in reported figures. Within ~1โ2% is expected; larger gaps usually mean a different FFN width or vocab than you assumed.
Related tools
- Feed-Forward (FFN/MLP) Parameter Calculator
- Embedding Parameter & Memory Calculator
- LLM FLOPs-per-Token Calculator
- Chinchilla Training FLOPs Calculator
- LLM Training GPU-Hours & Cost Calculator
- MFU (Model FLOPs Utilization) Calculator
- Manufacturing Defect Detection โ Confusion Matrix & Metrics Calculator
- F-beta Score Calculator
Related ML & AI tools
ROC-AUC Calculator (from TPR/FPR points)
Trapezoidal area under the ROC curve from your (FPR, TPR) operating points โ the threshold-independent ranking score.
โ LiveClassification Threshold Cost Calculator
Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.
โ LiveSilhouette Score Calculator
Cluster cohesion vs separation for one point โ the building block of the silhouette metric for choosing K.
โ Live