Conversation
No warmup as you have constant learning rate.
If you can actually use the validation data from T0 then I'd say this is better. |
For that either I think a) or c) is best - Wdyt? |
|
Probably need to use this: it's already implemented as an API https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/c5b88fb92d4417f77d729c95ce95e3a740b47065/megatron/arguments.py#L822-L840, I'll update the T0 branch to have that feature. |
train/t0/tr11f-6B3-ml-t0.slurm
Outdated
| --adam-eps 1e-8 \ | ||
| --lr 1e-3 \ | ||
| --lr-decay-style constant \ | ||
| --lr-warmup-samples $LR_WARMUP_SAMPLES \ |
There was a problem hiding this comment.
We used Adafactor ... so technically I don't know what parameters matter (typically we used a decay argument, which I don't know how it translates to Adam optimizer)
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
thomasw21
left a comment
There was a problem hiding this comment.
Looked mostly at 6B3 config. Seems alright. Thanks!
| " | ||
|
|
||
| export CMD=" \ | ||
| `pwd`/finetune_t0_non_causal_decoder.py \ |
There was a problem hiding this comment.
| `pwd`/finetune_t0_non_causal_decoder.py \ | |
| `pwd`/finetune_t0_causal_decoder.py \ |
Right now all the script use is_causal=True so we should rename this in Meg-DS PR.
There was a problem hiding this comment.
Added an arg here; Lets merge that PR first before we merge here
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
|
Already merged via other PR |
Notes:
RE: Learning Rate
T0 & FLAN use Adafactor which automatically adjusts the step size:
Finally, while the learning rate in Adam denotes a target absolute step size, we follow the intuition that relative change in the parameters is more relevant, so we propose scaling the size of the updates relative to the scale of the parameters themselves. Due to this scaling Adafactor may more resistent to higher learning rates and the step size adjusts automatically, so scheduling may be less needed (I.e. if you have weight decay with Adafactor, step size will automatically decay because parameters decay). For now I'm keeping a constant conservative LR of
1e-5, but we may want to instead go higher and add warmup + scheduling. Thoughts?