Warmup Steps Calculator

Right-size LR warmup from total steps, batch size and model scale — with the heuristics labs actually use.

Total training stepsRun typeTokens per step (M)

—

Suggested warmup (steps)

—

Of total run (%)

—

Warmup tokens (B)

Too little warmup risks early-run divergence and loss spikes; too much wastes the highest-value early tokens at tiny LRs. Recent work (and several lab recipes) trends toward SHORTER warmups than the folkloric 2000 steps for fine-tunes.

Formula

pretraining: max(1000, 1%) capped 5K · continued: ~0.5% (the model is already conditioned) · fine-tuning: 3% of steps, floor 20

References: Goyal et al. (2017), Accurate Large Minibatch SGD (gradual warmup); Llama / OLMo published training configurations

About Warmup Steps Calculator

Warmup is the seatbelt of LR schedules — universally worn, rarely sized. Start at full LR and early gradients through randomly-initialized layers can spike the loss irrecoverably; warm up too long and you spend your budget's freshest tokens learning at a crawl. This calculator encodes the working heuristics by run type: ~1% (floor 1000) for from-scratch pretraining, half that for continued pretraining where the network is already conditioned, and a short 3% for fine-tunes — where the folkloric 'warmup_steps=2000' copied from pretraining configs into 500-step LoRA runs means warming up forever, a real and common bug.

How to use Warmup Steps Calculator

1Enter your values into Warmup Steps Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
2The result recomputes live using the formula shown on the page; there is no button to press.
3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Warmup Steps Calculator?

✓Computes Warmup Steps instantly in your browser — no sign-up, no upload, no server round-trip.
✓100% free and unlimited, with the exact formula shown: pretraining: max(1000, 1%) capped 5K.
✓Runs entirely client-side, so every value you enter stays private on your device.
✓Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

What actually goes wrong without warmup?+

At initialization, Adam's second-moment estimates are uncalibrated and attention/LayerNorm gradients are disproportionately large; full-size first steps can blow activations into saturation — visible as an early loss spike that the run never fully recovers from. Warmup lets statistics settle while steps are harmless.

Is 2000 warmup steps a magic number?+

It's an inherited default from million-step pretraining era configs (where it's a rounding error). For a 1000-step SFT run it means NEVER reaching peak LR — your effective LR is half what you set. Size warmup relative to YOUR total steps; that is this tool's whole purpose.

Linear, cosine or exponential warmup?+

Shape barely matters; duration does. Linear is the universal default for its simplicity. Exotic ramps appear in papers but ablations consistently show duration (and the peak it ramps to) dominating any curvature effect.

Does bigger batch need longer warmup?+

Generally yes — larger batches pair with larger LRs (see our LR-scaling calculator), and bigger steps need a gentler approach. A practical rule: when you scale batch by k, scale warmup steps by √k to k alongside the LR adjustment.

Related tools

Related ML & AI tools

🧠

ROC-AUC Calculator (from TPR/FPR points)

Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.

● Live

🧠

Classification Threshold Cost Calculator

Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.

● Live

🧠

Silhouette Score Calculator

Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.

● Live