ToolJoltTools

Warmup Steps Calculator

Right-size LR warmup from total steps, batch size and model scale — with the heuristics labs actually use.

Suggested warmup (steps)
Of total run (%)
Warmup tokens (B)

Too little warmup risks early-run divergence and loss spikes; too much wastes the highest-value early tokens at tiny LRs. Recent work (and several lab recipes) trends toward SHORTER warmups than the folkloric 2000 steps for fine-tunes.

Formula

pretraining: max(1000, 1%) capped 5K · continued: ~0.5% (the model is already conditioned) · fine-tuning: 3% of steps, floor 20
References: Goyal et al. (2017), Accurate Large Minibatch SGD (gradual warmup); Llama / OLMo published training configurations

About Warmup Steps Calculator

Warmup is the seatbelt of LR schedules — universally worn, rarely sized. Start at full LR and early gradients through randomly-initialized layers can spike the loss irrecoverably; warm up too long and you spend your budget's freshest tokens learning at a crawl. This calculator encodes the working heuristics by run type: ~1% (floor 1000) for from-scratch pretraining, half that for continued pretraining where the network is already conditioned, and a short 3% for fine-tunes — where the folkloric 'warmup_steps=2000' copied from pretraining configs into 500-step LoRA runs means warming up forever, a real and common bug.

How to use Warmup Steps Calculator

  1. 1Enter your values into Warmup Steps Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
  2. 2The result recomputes live using the formula shown on the page; there is no button to press.
  3. 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Warmup Steps Calculator?

  • Computes Warmup Steps instantly in your browser — no sign-up, no upload, no server round-trip.
  • 100% free and unlimited, with the exact formula shown: pretraining: max(1000, 1%) capped 5K.
  • Runs entirely client-side, so every value you enter stays private on your device.
  • Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

What actually goes wrong without warmup?+

At initialization, Adam's second-moment estimates are uncalibrated and attention/LayerNorm gradients are disproportionately large; full-size first steps can blow activations into saturation — visible as an early loss spike that the run never fully recovers from. Warmup lets statistics settle while steps are harmless.

Is 2000 warmup steps a magic number?+

It's an inherited default from million-step pretraining era configs (where it's a rounding error). For a 1000-step SFT run it means NEVER reaching peak LR — your effective LR is half what you set. Size warmup relative to YOUR total steps; that is this tool's whole purpose.

Linear, cosine or exponential warmup?+

Shape barely matters; duration does. Linear is the universal default for its simplicity. Exotic ramps appear in papers but ablations consistently show duration (and the peak it ramps to) dominating any curvature effect.

Does bigger batch need longer warmup?+

Generally yes — larger batches pair with larger LRs (see our LR-scaling calculator), and bigger steps need a gentler approach. A practical rule: when you scale batch by k, scale warmup steps by √k to k alongside the LR adjustment.

Related tools

Related ML & AI tools

Sponsored