ToolJoltTools

Mixed Precision Training Memory Calculator

FP32 vs FP16/BF16-mixed vs pure-BF16 vs FP8 training memory side by side — where the savings actually are.

Weights (GB)
Gradients (GB)
Optimizer (Adam) (GB)
Total states (GB)

The counter-intuitive truth this tool exists to show: BF16 'mixed precision' does NOT shrink persistent training state (16 B/param either way — the fp32 master and moments dominate). Its wins are halved ACTIVATIONS and 2–8× faster tensor cores.

Formula

FP32: 4+4+8 = 16 B/param · BF16-mixed: 2+2+12 = 16 (!) · pure BF16: 2+2+8 = 12 · FP8(TE): 1+2+12 = 15 — states barely shrink; ACTIVATIONS and speed are the real wins
References: Micikevicius et al. (2018), Mixed Precision Training; NVIDIA Transformer Engine FP8 documentation

About Mixed Precision Training Memory Calculator

Ask why we train in BF16 and most answers are wrong: 'it halves memory' — yet the persistent state accounting shows mixed precision at the same 16 bytes per parameter as FP32, because the fp32 master weights and Adam moments never shrank. This calculator lays the four regimes side by side to make the real story visible: mixed precision wins on activations (which DO halve — see our activation calculator), on bandwidth, and on tensor-core throughput that is 2–16× higher at low precision. Knowing which budget line each trick touches is the difference between cargo-cult flags and an actually-fitting training run.

How to use Mixed Precision Training Memory Calculator

  1. 1Enter your values into Mixed Precision Training Memory Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
  2. 2The result recomputes live using the formula shown on the page; there is no button to press.
  3. 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Mixed Precision Training Memory Calculator?

  • Computes Mixed Precision Training Memory instantly in your browser — no sign-up, no upload, no server round-trip.
  • 100% free and unlimited, with the exact formula shown: FP32: 4+4+8 = 16 B/param.
  • Runs entirely client-side, so every value you enter stays private on your device.
  • Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

If states don't shrink, why does AMP let me raise batch size?+

Because activations — which scale with batch and sequence, not parameters — store at BF16 instead of FP32, halving the run's elastic memory component. For long sequences activations dwarf states, so the practical headroom gain is real even though this tool's persistent numbers match.

BF16 vs FP16 — why did BF16 win?+

Range: BF16 keeps FP32's 8 exponent bits, so gradients spanning many orders of magnitude don't overflow/underflow — no loss scaling needed. FP16's extra mantissa rarely matters for training. All modern accelerators (A100 on) made BF16 first-class, ending the debate.

When is pure-BF16 (no fp32 master) safe?+

With large batch sizes and stochastic rounding or Kahan-style compensated updates, several labs train without masters, saving 4 B/param. Risk: tiny LR updates vanish below BF16's precision near convergence. Default to masters; drop them only with eval-backed evidence.

Does FP8 training actually work?+

On Hopper+ with Transformer Engine: matmul inputs in FP8 (E4M3/E5M2 with per-tensor scaling), accumulation in higher precision, masters in fp32 — frontier labs report ~30–40% throughput gains at matched quality. State memory barely moves (this tool shows it); it's a SPEED technology.

Related tools

Related ML & AI tools

Sponsored