ToolJoltTools

FP8 Model Size Calculator

Size models stored in FP8 (E4M3) — the H100-era serving format — with per-tensor scale overhead.

Weights / file size (GB)
Total memory to run (GB)

FP8 (usually E4M3 for weights) halves BF16 memory while keeping a floating-point format the H100/H200 tensor cores execute natively. Per-tensor or per-channel scales add well under 1% — ≈8.06 effective bits per weight.

Formula

size(GB) = params × 8.06 bits ÷ 8 ÷ 10⁹ (8.06 = measured effective bits/weight for this format, incl. scales)
References: llama.cpp quantization documentation (k-quants); Frantar et al. (2022), GPTQ; Lin et al. (2023), AWQ; NVIDIA FP8 Transformer Engine docs

About FP8 Model Size Calculator

FP8 (usually E4M3 for weights) halves BF16 memory while keeping a floating-point format the H100/H200 tensor cores execute natively. Per-tensor or per-channel scales add well under 1% — ≈8.06 effective bits per weight. This calculator turns any parameter count into a concrete file size and a realistic total-memory figure, so you can check whether a given checkpoint fits your GPU VRAM or system RAM before downloading tens of gigabytes. It uses the measured effective bits-per-weight of the format — including block scales and mixed-precision tensor exceptions — rather than the marketing bit-width.

How to use FP8 Model Size Calculator

  1. 1Enter your values into FP8 Model Size Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
  2. 2The result recomputes live using the formula shown on the page; there is no button to press.
  3. 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use FP8 Model Size Calculator?

  • Computes FP8 Model Size instantly in your browser — no sign-up, no upload, no server round-trip.
  • 100% free and unlimited, with the exact formula shown: size(GB) = params × 8.06 bits ÷ 8 ÷ 10⁹ (8.06 = measured effective bits/weight for this format, incl. scales).
  • Runs entirely client-side, so every value you enter stays private on your device.
  • Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

FP8 vs INT8 — why did serving move to FP8?+

FP8's exponent bits absorb the outlier activations that force INT8 schemes into complex per-channel tricks. On Hopper GPUs FP8 matmuls run at full tensor-core rate, so you get the memory saving and a throughput win with simpler calibration.

What is the difference between E4M3 and E5M2?+

E4M3 (4 exponent, 3 mantissa bits) has finer precision and is used for weights and activations; E5M2 has more range and is mainly used for gradients in FP8 training. Serving stacks store weights as E4M3.

Does FP8 hurt model quality?+

With per-channel weight scales and calibrated activation scales, FP8 serving typically stays within 0.1% of BF16 on standard benchmarks — tighter than most 4-bit schemes. DeepSeek-V3 even released its native weights in FP8.

Can older GPUs run FP8 checkpoints?+

Pre-Hopper NVIDIA cards (A100, consumer RTX 30/40) lack FP8 tensor cores; frameworks either upcast to FP16 (losing the speed benefit, keeping the size one) or refuse. For Ada (RTX 40) FP8 is supported via Transformer Engine on some stacks.

Related tools

Related ML & AI tools

Sponsored