Activation Memory Calculator (Training)

Forward-pass activations a transformer stores for backprop — per layer and total, with and without checkpointing.

LayersHidden sizeSequence length (tokens)Micro-batch sizeGradient checkpointing

—

Per layer (GB)

—

Total activations (GB)

Defaults (7B-class, 4K sequence, micro-batch 2) show why training OOMs while inference doesn't: 36 GB of activations dwarf the 14 GB of weights. Checkpointing trades ~30% compute for an ~80% activation cut here.

Formula

activations ≈ 34·s·b·H bytes/layer (BF16, SwiGLU block incl. attention intermediates) × layers (or ~√L+1 with full checkpointing)

References: Korthikanti et al. (2022), Reducing Activation Recomputation in Large Transformer Models; Chen et al. (2016), Training Deep Nets with Sublinear Memory Cost

Disclaimer: This tool is for general informational and estimation purposes only and is not professional financial, tax, accounting or legal advice. All figures are estimates — verify with a qualified professional before making decisions. Read the full disclaimer.

About Activation Memory Calculator (Training)

Everyone budgets weights and optimizer states; activations are what actually blow up training memory. Every intermediate tensor of the forward pass — attention scores, SwiGLU gates, layer outputs — must survive until backward consumes it, scaling with batch × sequence × hidden × layers. This calculator uses the standard ~34 bytes-per-token-per-hidden estimate for BF16 transformer blocks and shows the checkpointing alternative that stores only √L boundary activations, recomputing the rest. The defaults explain the eternal mystery of why a model that infers in 16 GB demands 80 GB to fine-tune.

How to use Activation Memory Calculator (Training)

1Enter your values into Activation Memory Calculator (Training) — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
2The result recomputes live using the formula shown on the page; there is no button to press.
3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Activation Memory Calculator (Training)?

✓Computes Activation Memory instantly in your browser — no sign-up, no upload, no server round-trip.
✓100% free and unlimited, with the exact formula shown: activations ≈ 34.
✓Runs entirely client-side, so every value you enter stays private on your device.
✓Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Where does the 34 bytes/token/hidden figure come from?+

Counting a standard block's live BF16 tensors: QKV projections, attention output, two LayerNorm inputs, SwiGLU's gate/up/down intermediates (the big item at F≈3.5H) and residuals. Megatron's activation-recomputation paper derives it in detail; FlashAttention removes the s² score term already excluded here.

Why does sequence length hurt training memory more than batch size?+

They multiply identically in this formula — but sequence also grows the attention compute quadratically and you typically can't shard it the way you shard batch across GPUs. Doubling context at fixed tokens-per-step still doubles per-GPU activation memory.

What exactly does gradient checkpointing store?+

Only the inputs at checkpoint boundaries (classically every layer, optimally every √L layers). During backward, each segment's forward is recomputed from its checkpoint to regenerate the needed intermediates. Cost: one extra partial forward ≈ 30% more compute.

Does LoRA reduce activation memory?+

Barely — the frozen base model's forward activations are still needed to backprop into the adapters. LoRA's savings are in optimizer states and gradients. To cut activations you need shorter sequences, smaller micro-batches, or checkpointing.

Related tools

Related ML & AI tools

🧠

ROC-AUC Calculator (from TPR/FPR points)

Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.

● Live

🧠

Classification Threshold Cost Calculator

Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.

● Live

🧠

Silhouette Score Calculator

Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.

● Live