ToolJoltTools

Feed-Forward (FFN/MLP) Parameter Calculator

Parameters and memory of a transformer FFN block — classic GELU vs gated SwiGLU, any width multiple.

Per layer (M)
All layers (B)
Effective width ratio F/H

Defaults are Llama 3 8B's FFN (H=4096, F=14336, SwiGLU): 176M per layer — 5.6B of the model's 8B parameters live here.

Formula

classic: 2·H·F (W_up, W_down) · SwiGLU: 3·H·F (W_gate, W_up, W_down) — Llama-style models choose F ≈ ⅔·4·H to keep cost equal
References: Shazeer (2020), GLU Variants Improve Transformer; Touvron et al. (2023), LLaMA: Open and Efficient Foundation Models

About Feed-Forward (FFN/MLP) Parameter Calculator

The feed-forward block is the quiet giant of every transformer: in Llama-class models roughly 70% of all non-embedding parameters sit in these three matrices. This calculator compares the classic two-matrix GELU design against the gated SwiGLU used by Llama, Mistral, Qwen and Gemma, and totals the cost across layers. The famous '⅔ rule' appears naturally: SwiGLU papers set F to two-thirds of the classic 4H so the three matrices cost the same as the old two — type F=10923 with H=4096 classic to verify.

How to use Feed-Forward (FFN/MLP) Parameter Calculator

  1. 1Enter your values into Feed-Forward (FFN/MLP) Parameter Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
  2. 2The result recomputes live using the formula shown on the page; there is no button to press.
  3. 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Feed-Forward (FFN/MLP) Parameter Calculator?

  • Computes Feed-Forward (FFN/MLP) Parameter instantly in your browser — no sign-up, no upload, no server round-trip.
  • 100% free and unlimited, with the exact formula shown: classic: 2.
  • Runs entirely client-side, so every value you enter stays private on your device.
  • Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Why does SwiGLU have three matrices instead of two?+

It splits the up-projection into a 'gate' and an 'up' path, multiplying them elementwise (SiLU(gate)·up) before projecting down. The gating gives a data-dependent filter that consistently improves quality at equal parameter cost — Shazeer's 'divine benevolence' result.

What intermediate size should I choose when designing a model?+

Convention: classic FFN uses F=4H; SwiGLU uses F≈(2/3)·4H≈2.67H rounded to hardware-friendly multiples (Llama 3 8B: 14336 = 3.5H, slightly wider). Bigger F buys capacity linearly; most labs scale H and L first and keep F/H fixed.

How much inference memory does the FFN need beyond weights?+

Activations: each token materializes gate and up vectors of size F (28 KB at F=14336, BF16). With batching this is small next to the KV cache, but during training these activations dominate — see our activation-memory calculator.

Is the FFN why MoE models are huge?+

Yes — MoE replaces each FFN with N expert copies and routes tokens to a few. Mixtral 8x7B: attention stays dense (shared), but 8 FFN experts × 32 layers push totals from 7B to 46.7B. The FFN is where parameters scale cheapest.

Related tools

Related ML & AI tools

Sponsored