ToolJoltTools

Attention Layer Parameter Calculator

Parameters of one multi-head / GQA / MQA attention block — Wq, Wk, Wv, Wo broken out.

Wq + Wo (M)
Wk + Wv (M)
Total per layer (M)
Saved vs MHA (%)

Defaults match one Llama 3 70B layer (H=8192, 64 heads, 8 KV heads). Qwen2 models set biases on Q/K/V — flip the toggle to see the (tiny) difference.

Formula

Wq = H×H, Wo = H×H, Wk = Wv = H × (kv_heads × head_dim); head_dim = H / query_heads
References: Ainslie et al. (2023), GQA: Training Generalized Multi-Query Transformer; Shazeer (2019), Fast Transformer Decoding (MQA)

About Attention Layer Parameter Calculator

Multi-head, grouped-query or multi-query attention — the choice shows up in three places: parameter count, KV-cache size and quality. This calculator isolates the first: it breaks one attention block into its four projection matrices and shows exactly how many weights GQA or MQA shave off versus full MHA at your hidden size. Defaults model a Llama 3 70B layer. Use it together with our KV-cache calculators to see the full picture — the cache saving (4–64×) is far more dramatic than the weight saving (~25%).

How to use Attention Layer Parameter Calculator

  1. 1Enter your values into Attention Layer Parameter Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
  2. 2The result recomputes live using the formula shown on the page; there is no button to press.
  3. 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Attention Layer Parameter Calculator?

  • Computes Attention Layer Parameter instantly in your browser — no sign-up, no upload, no server round-trip.
  • 100% free and unlimited, with the exact formula shown: Wq = H×H, Wo = H×H, Wk = Wv = H × (kv_heads × head_dim); head_dim = H / query_heads.
  • Runs entirely client-side, so every value you enter stays private on your device.
  • Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

How much does GQA save in parameters versus MHA?+

With 8 KV heads on 64 query heads, Wk and Wv shrink 8×, cutting the block from 4H² to about 2.25H² — a ~44% reduction of K/V weights but only ~25–30% of the whole attention block, since Wq and Wo stay full size.

Why keep Wo full-sized in MQA/GQA?+

The output projection mixes the concatenated head outputs back into the residual stream; its dimensions depend on query heads, not KV heads. Shrinking it would reduce the model's mixing capacity rather than the cache, so all schemes leave it at H×H.

Does MQA hurt model quality?+

Pure MQA (1 KV head) measurably degrades quality on larger models, which is why GQA with 4–8 KV heads became the standard compromise — near-MQA cache economics with near-MHA quality, as the GQA paper demonstrated on T5 and LLaMA-class models.

Which models use attention biases?+

Most modern LLMs (Llama, Mistral, Gemma) drop all biases. Qwen 1.5/2 keep biases on Q, K, V projections, claiming better length extrapolation. The toggle shows biases add only ~0.02% of parameters — their effect is empirical, not budgetary.

Related tools

Related ML & AI tools

Sponsored