Optimizer State Memory Calculator
Memory of AdamW, SGD, Adafactor, 8-bit Adam and Lion states for any model size — the hidden 2–3× weights tax.
AdamW's two fp32 moments cost 4× the BF16 weights they update — for a 7B model, 56 GB of states against 14 GB of weights. This single line item is why ZeRO shards optimizer state first.
Formula
About Optimizer State Memory Calculator
The optimizer is training's silent memory hog: AdamW keeps two fp32 moments per parameter — 8 bytes — plus most mixed-precision recipes hold a 4-byte fp32 master copy, totaling 12 bytes of state for every 2-byte BF16 weight. This calculator prices the standard menu: full AdamW, bitsandbytes' 8-bit Adam (block-wise quantized moments, near-identical convergence), SGD with momentum, the nearly-free factored Adafactor, and Lion's single momentum. It is the fastest way to see why QLoRA and ZeRO exist and which optimizer swap buys your fine-tune back onto one GPU.
How to use Optimizer State Memory Calculator
- 1Enter your values into Optimizer State Memory Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
- 2The result recomputes live using the formula shown on the page; there is no button to press.
- 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.
Why use Optimizer State Memory Calculator?
- ✓Computes Optimizer State Memory instantly in your browser — no sign-up, no upload, no server round-trip.
- ✓100% free and unlimited, with the exact formula shown: AdamW: 8 B/param (fp32 m, v).
- ✓Runs entirely client-side, so every value you enter stays private on your device.
- ✓Live recompute as you type, with a worked example and authoritative references for trust.
Frequently asked questions
Does 8-bit Adam hurt convergence?+
Measurably, no — Dettmers et al. showed block-wise quantized moments with dynamic scaling match fp32 Adam across language and vision benchmarks. It is the default in many fine-tuning stacks (axolotl, TRL) and saves 6 bytes per parameter, 42 GB on a 7B full fine-tune.
How does Adafactor get away with ~zero memory?+
It factorizes the second moment of each weight MATRIX into a row vector and a column vector (R + C numbers instead of R×C), reconstructing estimates by outer product. T5 and PaLM trained on it. The cost is some optimizer-quality loss, often offset by bigger batches.
When can I skip fp32 master weights?+
With pure-BF16 training (stochastic rounding or high batch sizes) some labs drop the master copy — saving 4 B/param at some convergence risk. FP16 training cannot: its narrow range needs the fp32 anchor plus loss scaling. Toggle the option here to see the savings.
Why does LoRA shrink this calculator's number so dramatically?+
Optimizer state scales with TRAINABLE parameters. LoRA trains ~0.1–1% of weights, so AdamW's 12 B/param applies to millions, not billions — kilobytes of moments instead of tens of GB. Set the parameter input to your adapter count to see it.
Related tools
- Gradient Accumulation Calculator
- Learning Rate Scaling Calculator
- RTX 3060 12GB LLM Calculator — What Can It Run?
- RTX 4060 Ti 16GB LLM Calculator — What Can It Run?
- RTX 3090 24GB LLM Calculator — What Can It Run?
- RTX 4090 24GB LLM Calculator — What Can It Run?
- PPV / NPV & Prevalence Calculator
- Balanced Accuracy & Youden's J Calculator
Related ML & AI tools
ROC-AUC Calculator (from TPR/FPR points)
Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.
● LiveClassification Threshold Cost Calculator
Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.
● LiveSilhouette Score Calculator
Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.
● Live