ToolJoltTools

Cosine LR Schedule Calculator

Learning rate at any step of a warmup-plus-cosine schedule — the default decay of modern LLM training.

LR at this step
Of peak (%)
Phase

Defaults mirror a Llama-style recipe: 2K warmup, cosine to 10% of 3e-4 over 100K steps. The half-cosine spends ~50% of training above 70% of peak LR — far 'hotter' than linear decay, one reason it wins.

Formula

warmup: LR = peak·(step/warmup) · decay: LR = peak·[min + (1−min)·½(1+cos(π·t))], t = progress after warmup
References: Loshchilov & Hutter (2017), SGDR: Stochastic Gradient Descent with Warm Restarts; Touvron et al. (2023), LLaMA training recipe

About Cosine LR Schedule Calculator

Almost every modern LLM — Llama, Mistral, Qwen — trains on the same schedule: linear warmup to a peak, then a single half-cosine down to ~10% of peak. This calculator evaluates that curve at any step, so you can resume runs at the right LR, sanity-check a logged value against the recipe, or compare how much 'heat' cosine keeps versus linear decay (spoiler: half of training happens above 70% of peak). Set the final-fraction to zero for the original SGDR-style anneal, or query the warmup region to verify your ramp.

How to use Cosine LR Schedule Calculator

  1. 1Enter your values into Cosine LR Schedule Calculator — sensible, domain-typical defaults are pre-filled so you see a real result immediately.
  2. 2The result recomputes live using the formula shown on the page; there is no button to press.
  3. 3Adjust any input to compare scenarios, then read the worked example to see the substituted numbers.

Why use Cosine LR Schedule Calculator?

  • Computes Cosine LR Schedule instantly in your browser — no sign-up, no upload, no server round-trip.
  • 100% free and unlimited, with the exact formula shown: warmup: LR = peak.
  • Runs entirely client-side, so every value you enter stays private on your device.
  • Live recompute as you type, with a worked example and authoritative references for trust.

Frequently asked questions

Why cosine instead of linear or step decay?+

Empirically it keeps LR high through the productive middle of training and lands softly at the end, beating step schedules in most LLM ablations. Its single hyperparameter (final fraction) is also harder to mis-tune than step-decay milestones.

Why decay to 10% of peak instead of zero?+

Decaying fully to zero wastes the final tokens on near-frozen weights. Chinchilla and Llama found a 10% floor trains measurably better when the token budget is fixed — the model keeps learning until the end. Zero floors suit warm-restart (SGDR) setups instead.

I'm resuming a crashed run — what LR do I set?+

Enter the original peak, warmup, total steps and your resume step: the output is the LR your scheduler should emit. Off-by-one-thousand errors here (resuming at peak instead of mid-decay) quietly destroy a run's final quality — this is the check worth thirty seconds.

What about WSD (warmup-stable-decay) schedules?+

WSD holds LR constant after warmup and decays only in a short final phase — letting you branch checkpoints at any point of the stable phase and 'finish' them cheaply. MiniCPM and DeepSeek used it. Cosine remains the default when total steps are known up front.

Related tools

Related ML & AI tools

Sponsored