FP8 Model Size Calculator
Size models stored in FP8 (E4M3) — the H100-era serving format — with per-tensor scale overhead.
FP8 (usually E4M3 for weights) halves BF16 memory while keeping a floating-point format the H100/H200 tensor cores execute natively. Per-tensor or per-channel scales add well under 1% — ≈8.06 effective bits per weight.
Formula
About FP8 Model Size Calculator
FP8 (usually E4M3 for weights) halves BF16 memory while keeping a floating-point format the H100/H200 tensor cores execute natively. Per-tensor or per-channel scales add well under 1% — ≈8.06 effective bits per weight. This calculator turns any parameter count into a concrete file size and a realistic total-memory figure, so you can check whether a given checkpoint fits your GPU VRAM or system RAM before downloading tens of gigabytes. It uses the measured effective bits-per-weight of the format — including block scales and mixed-precision tensor exceptions — rather than the marketing bit-width.
How to use FP8 Model Size Calculator
- 1Enter your values into FP8 Model Size Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
- 2The result recomputes live using the formula shown on the page; there is no button to press.
- 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.
Why use FP8 Model Size Calculator?
- ✓Computes FP8 Model Size instantly in your browser — no sign-up, no upload, no server round-trip.
- ✓100% free and unlimited, with the exact formula shown: size(GB) = params × 8.06 bits ÷ 8 ÷ 10⁹ (8.06 = measured effective bits/weight for this format, incl. scales).
- ✓Runs entirely client-side, so every value you enter stays private on your device.
- ✓Live recompute as you type, with a worked example and authoritative references for trust.
Frequently asked questions
FP8 vs INT8 — why did serving move to FP8?+
FP8's exponent bits absorb the outlier activations that force INT8 schemes into complex per-channel tricks. On Hopper GPUs FP8 matmuls run at full tensor-core rate, so you get the memory saving and a throughput win with simpler calibration.
What is the difference between E4M3 and E5M2?+
E4M3 (4 exponent, 3 mantissa bits) has finer precision and is used for weights and activations; E5M2 has more range and is mainly used for gradients in FP8 training. Serving stacks store weights as E4M3.
Does FP8 hurt model quality?+
With per-channel weight scales and calibrated activation scales, FP8 serving typically stays within 0.1% of BF16 on standard benchmarks — tighter than most 4-bit schemes. DeepSeek-V3 even released its native weights in FP8.
Can older GPUs run FP8 checkpoints?+
Pre-Hopper NVIDIA cards (A100, consumer RTX 30/40) lack FP8 tensor cores; frameworks either upcast to FP16 (losing the speed benefit, keeping the size one) or refuse. For Ada (RTX 40) FP8 is supported via Transformer Engine on some stacks.
Related tools
- Attention Layer Parameter Calculator
- Feed-Forward (FFN/MLP) Parameter Calculator
- Embedding Parameter & Memory Calculator
- LLM FLOPs-per-Token Calculator
- Chinchilla Training FLOPs Calculator
- LLM Training GPU-Hours & Cost Calculator
- Image Classification (binary) — Confusion Matrix & Metrics Calculator
- Manufacturing Defect Detection — Confusion Matrix & Metrics Calculator
Related ML & AI tools
ROC-AUC Calculator (from TPR/FPR points)
Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.
● LiveClassification Threshold Cost Calculator
Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.
● LiveSilhouette Score Calculator
Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.
● Live