ToolJoltTools

Transformer Parameter Count Calculator

Count parameters of a decoder-only transformer from layers, hidden size, FFN width, vocab and GQA heads.

โ€”
Attention params (B)
โ€”
FFN params (B)
โ€”
Embedding params (B)
โ€”
Total (B)

Defaults reproduce Llama 3 8B: the formula lands within ~1% of the official 8.03B. Swap in any config.json values to audit a model's advertised size.

Formula

P โ‰ˆ Lยท(2Hยฒ + 2ยทHยทkvยทhead_dim) [attention, GQA] + Lยท3ยทHยทF [SwiGLU FFN] + (1 or 2)ยทVยทH [embeddings] + 2LH [norms]
References: Vaswani et al. (2017), Attention Is All You Need; Llama 3 config.json (Hugging Face)

About Transformer Parameter Count Calculator

Model cards advertise parameter counts, but where do those billions actually live? This calculator rebuilds a decoder-only transformer's size from its architecture: attention projections (with grouped-query attention correctly shrinking K/V), the SwiGLU feed-forward block that typically holds two-thirds of all weights, embedding tables, and RMSNorm vectors. The defaults reproduce Llama 3 8B within one percent โ€” change any field to see exactly which design choice costs parameters. It is the fastest way to sanity-check a paper's claimed size or to design your own training run's budget.

How to use Transformer Parameter Count Calculator

  1. 1Enter your values into Transformer Parameter Count Calculator โ€” sensible, domain-typical defaults are pre-filled so you see a real result immediately.
  2. 2The result recomputes live using the formula shown on the page; there is no button to press.
  3. 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Transformer Parameter Count Calculator?

  • โœ“Computes Transformer Parameter Count instantly in your browser โ€” no sign-up, no upload, no server round-trip.
  • โœ“100% free and unlimited, with the exact formula shown: P โ‰ˆ L.
  • โœ“Runs entirely client-side, so every value you enter stays private on your device.
  • โœ“Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Where do most of a transformer's parameters live?+

In the FFN. With SwiGLU at Fโ‰ˆ3.5H, the feed-forward block is ~65โ€“70% of non-embedding parameters; attention is ~25โ€“30%. That is why MoE models replicate experts on the FFN โ€” it is where parameter scaling pays most.

Why does GQA barely change the total parameter count?+

GQA shrinks only the K and V projections (from Hร—H to Hร—kvยทhead_dim each). For Llama 3 8B that saves ~0.4B parameters (5%) โ€” its real benefit is the 4ร— smaller KV cache at inference, not the weight saving.

What is embedding tying and who uses it?+

Tying reuses the input embedding matrix as the output head, saving Vร—H parameters. Gemma and many small models tie (their huge vocabs would otherwise dominate); Llama 3 does not. The toggle shows the difference instantly.

Why does my count differ slightly from the official number?+

Small terms: biases (most modern LLMs have none), the final norm, rotary embeddings (no parameters), or rounding in reported figures. Within ~1โ€“2% is expected; larger gaps usually mean a different FFN width or vocab than you assumed.

Related tools

Related ML & AI tools

Sponsored