ToolJoltTools

Learning Rate Scaling Calculator

Scale a known-good LR to a new batch size — linear and square-root rules, with warmup-steps suggestion.

Scaled LR
Batch ratio (×)
Suggested warmup (steps)

Linear scaling comes from SGD analysis (Goyal et al.'s ImageNet-in-1-hour); for Adam-family optimizers, √-scaling matches both theory and large-scale practice better. Past ~8× increases, neither rule survives without retuning.

Formula

linear: LR' = LR × (B'/B) · square-root: LR' = LR × √(B'/B) — pair any increase with longer warmup
References: Goyal et al. (2017), Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour; Malladi et al. (2022), On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

About Learning Rate Scaling Calculator

You found a learning rate that works at batch 256 and now you're training at 1024 — multiply by 4, by 2, or pray? This calculator applies the two principled answers: the linear rule (gradient noise scales inversely with batch, so step proportionally — solid for SGD) and the square-root rule that better matches Adam-family optimizers in both SDE analysis and large-model practice. It also suggests a warmup length, because every batch-size jump destabilizes early training and warmup is the standard insurance.

How to use Learning Rate Scaling Calculator

  1. 1Enter your values into Learning Rate Scaling Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
  2. 2The result recomputes live using the formula shown on the page; there is no button to press.
  3. 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Learning Rate Scaling Calculator?

  • Computes Learning Rate Scaling instantly in your browser — no sign-up, no upload, no server round-trip.
  • 100% free and unlimited, with the exact formula shown: linear: LR' = LR × (B'/B).
  • Runs entirely client-side, so every value you enter stays private on your device.
  • Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Linear or square-root — how do I choose?+

By optimizer: momentum-SGD pipelines (classic vision) follow the linear rule well up to ~8K batches; AdamW (all modern LLM work) empirically tracks √-scaling — Malladi et al. derive why via the SDE limit. When in doubt, √ is the safer (smaller) adjustment.

Why does large-batch training need warmup at all?+

Early training has enormous gradients through a poorly-conditioned loss surface; a full-size LR there causes divergence or permanently damaged representations ('early overfitting'). Linear warmup over 1–3% of steps lets curvature stabilize before full step sizes arrive.

Is there a batch size beyond which scaling stops helping?+

Yes — the critical batch size (McCandlish et al.'s gradient-noise scale): below it, bigger batches buy near-linear wall-clock speedup; above it, returns vanish and you burn compute for nothing. For LLM pretraining it grows during training, hence batch-size ramps in modern recipes.

Do these rules apply to fine-tuning?+

Loosely — fine-tuning runs are short, LRs are tiny (1e-5–2e-4 LoRA), and the loss surface starts pre-conditioned. The rules give a sane starting point when you change batch, but a quick LR sweep (3 values, half an epoch) beats any formula at fine-tuning scale.

Related tools

Related ML & AI tools

Sponsored