ZeRO / FSDP Per-GPU Memory Calculator
Per-GPU memory at ZeRO stages 0–3 (FSDP full-shard) for any model and cluster size — see what each stage shards.
Defaults: 13B with AdamW is 208 GB of states — impossible on any single GPU, but ZeRO-2 across 8 GPUs needs just 30.5 GB each. Stage 3 / FSDP shards weights too, paying an all-gather per layer per step.
Formula
About ZeRO / FSDP Per-GPU Memory Calculator
ZeRO's insight was that data-parallel replicas waste memory: every GPU holding identical copies of 12-bytes-per-parameter optimizer state is redundancy you can shard away. This calculator walks the ladder — stage 1 shards optimizer states (the big win), stage 2 adds gradients, stage 3/FSDP shards the weights themselves — showing per-GPU memory for your model and cluster. The communication price rises with each stage (stage 3 all-gathers every layer's weights each step), which is why practitioners pick the LOWEST stage that fits rather than defaulting to maximum sharding.
How to use ZeRO / FSDP Per-GPU Memory Calculator
- 1Enter your values into ZeRO / FSDP Per-GPU Memory Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
- 2The result recomputes live using the formula shown on the page; there is no button to press.
- 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.
Why use ZeRO / FSDP Per-GPU Memory Calculator?
- ✓Computes ZeRO / FSDP Per-GPU Memory instantly in your browser — no sign-up, no upload, no server round-trip.
- ✓100% free and unlimited, with the exact formula shown: per-GPU = 2P[÷N if stage≥3] + 2P[÷N if ≥2] + 12P[÷N if ≥1] (BF16 weights+grads, fp32 Adam m,v + master).
- ✓Runs entirely client-side, so every value you enter stays private on your device.
- ✓Live recompute as you type, with a worked example and authoritative references for trust.
Frequently asked questions
Which ZeRO stage should I use?+
The lowest that fits after activations: stage 1 if optimizer state is your only problem (common for 7B on 8×80 GB), stage 2 as the usual sweet spot (negligible extra comms — gradients were being reduced anyway), stage 3 only when weights themselves overflow, accepting ~10–30% slowdown.
Is FSDP the same as ZeRO-3?+
Conceptually yes — fully sharded weights, gradients and optimizer states with on-demand all-gather. FSDP is PyTorch-native with finer control (wrapping policy, CPU offload, hybrid sharding = ZeRO-3 within a node + DDP across nodes); DeepSpeed ZeRO offers stages and NVMe offload.
Why doesn't this calculator include activations?+
Activations don't shard with ZeRO — they scale with your micro-batch and sequence per GPU regardless of stage (tensor/sequence parallelism handles them instead). Use our activation-memory calculator and add the two; that sum is what must fit in VRAM.
What is hybrid sharding (HSDP)?+
Shard fully within each node (fast NVLink absorbs the all-gathers) and replicate across nodes (slow Ethernet only carries gradient reduction). It is the pragmatic large-cluster default, trading some memory versus global ZeRO-3 for much better scaling.
Related tools
- Learning Rate Scaling Calculator
- RTX 3060 12GB LLM Calculator — What Can It Run?
- RTX 4060 Ti 16GB LLM Calculator — What Can It Run?
- RTX 3090 24GB LLM Calculator — What Can It Run?
- RTX 4090 24GB LLM Calculator — What Can It Run?
- RTX 5090 32GB LLM Calculator — What Can It Run?
- Balanced Accuracy & Youden's J Calculator
- Regression Metrics Calculator (MAE, MSE, RMSE, R²)
Related ML & AI tools
ROC-AUC Calculator (from TPR/FPR points)
Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.
● LiveClassification Threshold Cost Calculator
Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.
● LiveSilhouette Score Calculator
Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.
● Live