Learning Rate Scaling Calculator
Scale a known-good LR to a new batch size — linear and square-root rules, with warmup-steps suggestion.
Linear scaling comes from SGD analysis (Goyal et al.'s ImageNet-in-1-hour); for Adam-family optimizers, √-scaling matches both theory and large-scale practice better. Past ~8× increases, neither rule survives without retuning.
Formula
About Learning Rate Scaling Calculator
You found a learning rate that works at batch 256 and now you're training at 1024 — multiply by 4, by 2, or pray? This calculator applies the two principled answers: the linear rule (gradient noise scales inversely with batch, so step proportionally — solid for SGD) and the square-root rule that better matches Adam-family optimizers in both SDE analysis and large-model practice. It also suggests a warmup length, because every batch-size jump destabilizes early training and warmup is the standard insurance.
How to use Learning Rate Scaling Calculator
- 1Enter your values into Learning Rate Scaling Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
- 2The result recomputes live using the formula shown on the page; there is no button to press.
- 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.
Why use Learning Rate Scaling Calculator?
- ✓Computes Learning Rate Scaling instantly in your browser — no sign-up, no upload, no server round-trip.
- ✓100% free and unlimited, with the exact formula shown: linear: LR' = LR × (B'/B).
- ✓Runs entirely client-side, so every value you enter stays private on your device.
- ✓Live recompute as you type, with a worked example and authoritative references for trust.
Frequently asked questions
Linear or square-root — how do I choose?+
By optimizer: momentum-SGD pipelines (classic vision) follow the linear rule well up to ~8K batches; AdamW (all modern LLM work) empirically tracks √-scaling — Malladi et al. derive why via the SDE limit. When in doubt, √ is the safer (smaller) adjustment.
Why does large-batch training need warmup at all?+
Early training has enormous gradients through a poorly-conditioned loss surface; a full-size LR there causes divergence or permanently damaged representations ('early overfitting'). Linear warmup over 1–3% of steps lets curvature stabilize before full step sizes arrive.
Is there a batch size beyond which scaling stops helping?+
Yes — the critical batch size (McCandlish et al.'s gradient-noise scale): below it, bigger batches buy near-linear wall-clock speedup; above it, returns vanish and you burn compute for nothing. For LLM pretraining it grows during training, hence batch-size ramps in modern recipes.
Do these rules apply to fine-tuning?+
Loosely — fine-tuning runs are short, LRs are tiny (1e-5–2e-4 LoRA), and the loss surface starts pre-conditioned. The rules give a sane starting point when you change batch, but a quick LR sweep (3 values, half an epoch) beats any formula at fine-tuning scale.
Related tools
- RTX 4060 Ti 16GB LLM Calculator — What Can It Run?
- RTX 3090 24GB LLM Calculator — What Can It Run?
- RTX 4090 24GB LLM Calculator — What Can It Run?
- RTX 5090 32GB LLM Calculator — What Can It Run?
- RTX A6000 48GB LLM Calculator — What Can It Run?
- Tesla T4 16GB LLM Calculator — What Can It Run?
- Log Loss / Cross-Entropy Calculator
- Brier Score Calculator
Related ML & AI tools
ROC-AUC Calculator (from TPR/FPR points)
Trapezoidal area under the ROC curve from your (FPR, TPR) operating points — the threshold-independent ranking score.
● LiveClassification Threshold Cost Calculator
Find the probability cutoff that minimizes expected cost given your false-positive and false-negative penalties.
● LiveSilhouette Score Calculator
Cluster cohesion vs separation for one point — the building block of the silhouette metric for choosing K.
● Live