ToolJoltTools

Optimizer State Memory Calculator

Memory of AdamW, SGD, Adafactor, 8-bit Adam and Lion states for any model size — the hidden 2–3× weights tax.

Optimizer states (GB)
Bytes per parameter
States + bf16 weights + grads (GB)

AdamW's two fp32 moments cost 4× the BF16 weights they update — for a 7B model, 56 GB of states against 14 GB of weights. This single line item is why ZeRO shards optimizer state first.

Formula

AdamW: 8 B/param (fp32 m, v) · 8-bit Adam: 2 · SGD+mom: 4 · Adafactor: ~0 (factored) · Lion: 4 · +4 B/param master copy in mixed precision
References: Kingma & Ba (2015), Adam; Loshchilov & Hutter (2019), AdamW; Dettmers et al. (2022), 8-bit Optimizers via Block-wise Quantization; Shazeer & Stern (2018), Adafactor

About Optimizer State Memory Calculator

The optimizer is training's silent memory hog: AdamW keeps two fp32 moments per parameter — 8 bytes — plus most mixed-precision recipes hold a 4-byte fp32 master copy, totaling 12 bytes of state for every 2-byte BF16 weight. This calculator prices the standard menu: full AdamW, bitsandbytes' 8-bit Adam (block-wise quantized moments, near-identical convergence), SGD with momentum, the nearly-free factored Adafactor, and Lion's single momentum. It is the fastest way to see why QLoRA and ZeRO exist and which optimizer swap buys your fine-tune back onto one GPU.

How to use Optimizer State Memory Calculator

  1. 1Enter your values into Optimizer State Memory Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
  2. 2The result recomputes live using the formula shown on the page; there is no button to press.
  3. 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Optimizer State Memory Calculator?

  • Computes Optimizer State Memory instantly in your browser — no sign-up, no upload, no server round-trip.
  • 100% free and unlimited, with the exact formula shown: AdamW: 8 B/param (fp32 m, v).
  • Runs entirely client-side, so every value you enter stays private on your device.
  • Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Does 8-bit Adam hurt convergence?+

Measurably, no — Dettmers et al. showed block-wise quantized moments with dynamic scaling match fp32 Adam across language and vision benchmarks. It is the default in many fine-tuning stacks (axolotl, TRL) and saves 6 bytes per parameter, 42 GB on a 7B full fine-tune.

How does Adafactor get away with ~zero memory?+

It factorizes the second moment of each weight MATRIX into a row vector and a column vector (R + C numbers instead of R×C), reconstructing estimates by outer product. T5 and PaLM trained on it. The cost is some optimizer-quality loss, often offset by bigger batches.

When can I skip fp32 master weights?+

With pure-BF16 training (stochastic rounding or high batch sizes) some labs drop the master copy — saving 4 B/param at some convergence risk. FP16 training cannot: its narrow range needs the fp32 anchor plus loss scaling. Toggle the option here to see the savings.

Why does LoRA shrink this calculator's number so dramatically?+

Optimizer state scales with TRAINABLE parameters. LoRA trains ~0.1–1% of weights, so AdamW's 12 B/param applies to millions, not billions — kilobytes of moments instead of tens of GB. Set the parameter input to your adapter count to see it.

Related tools

Related ML & AI tools

Sponsored