Conversation
| [0, 0, 0, 1, 0, 0, 0], | ||
| [0, 0, 0, 1, 1, 0, 0], | ||
| [0, 0, 0, 1, 1, 1, 0], | ||
| [0, 0, 0, 0, 0, 0, 0]]]] |
There was a problem hiding this comment.
| [0, 0, 0, 0, 0, 0, 0]]]] | |
| [0, 0, 0, 0, 0, 0, 1]]]] |
There was a problem hiding this comment.
I don't think there is a 1 , because the last row & column is 100% padding
There was a problem hiding this comment.
Hum I'm wondering if this doesn't screw something up. Essentially you're going to compute softmax on a row with only zeros ...
There was a problem hiding this comment.
The last row & last col are the attention scores of the last token with respect to the last token. Since the last token is masked out in our loss_mask it doesn't matter I think.
Also it's a row with only -inf, no?
There was a problem hiding this comment.
No you compute softmax, what should be the result of the softmax of a row full of masked out values .... It feels like that would return lots of Nans.
There was a problem hiding this comment.
Don't we fill it with -inf?
And the softmax of a row where all values are the same is just 1/n, no? Where would it cause NaNs?
There was a problem hiding this comment.
You can try writing a test but I would be pretty sure that the actual results are 0. (with current kernel)
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
finetune_t0_non_causal_decoder.py
Outdated
| # Run non-causal decoder | ||
| is_causal=False, | ||
| causal_mask=~(causal_mask.bool()), | ||
| is_causal=True, |
There was a problem hiding this comment.
let's rename this file finetune_t0_causal_decoder then
There was a problem hiding this comment.
What about just finetune_t0.py?
There was a problem hiding this comment.
Right but do we hardcode this everytime? I'd rather have this one be the script for causal decoder.
There was a problem hiding this comment.
Added an argument prefixlm
| [0, 0, 0, 1, 0, 0, 0], | ||
| [0, 0, 0, 1, 1, 0, 0], | ||
| [0, 0, 0, 1, 1, 1, 0], | ||
| [0, 0, 0, 0, 0, 0, 0]]]] |
There was a problem hiding this comment.
No you compute softmax, what should be the result of the softmax of a row full of masked out values .... It feels like that would return lots of Nans.
* Tmp lossseq * Efficient loss normalization * Reuse variable * Simplify division * Add norm_target_loss arg * Clarify loss on targets & remove kwarg * Loss mask is already float * Move norm to batch pipe * Reshape loss mask * Move view
thomasw21
left a comment
There was a problem hiding this comment.
Nice work! Some things I think shouldn't be in this PR.
| 'This is mostly used for prefix_lm training') | ||
| group.add_argument("--noise-density", type=float, default=None, help="Span corruption noise density") | ||
| group.add_argument("--mean-noise-span-length", type=int, default=None, help="Span corruption mean noise span length") | ||
| group.add_argument("--prefixlm", action='store_true', help="Whether to train a PrefixLM - To be used with finetune t0") |
There was a problem hiding this comment.
Yeah actually let's remove that option. I don't think we've trained one successfully. We'll probably do as people have shown that it works but in another PR IMO.
| if args.deepspeed: | ||
| load_optimizer_states = False if args.no_load_optim else True | ||
| loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) | ||
| load_optimizer_states = not args.no_load_optim |
There was a problem hiding this comment.
Just use no_load_optim directly in the method
|
|
||
| # Set iteration. | ||
| if args.finetune or release: | ||
| if args.finetune or release or args.reset_progress: |
There was a problem hiding this comment.
Why is it that we didn't set finetune to True?
| assert args.consumed_train_samples == 0 | ||
| assert args.consumed_valid_samples == 0 | ||
| if 'args' in state_dict: | ||
| if 'args' in state_dict and not args.reset_progress: |
There was a problem hiding this comment.
Can you add a comment? Typically this is only used because the metadata loading mechanism screws with us.
|
|
||
| # Build the indexed mapping if not exist. | ||
| if torch.distributed.get_rank() == 0: | ||
| if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0: |
There was a problem hiding this comment.
Afaik you added this code; I think it was for running tests or sth
There was a problem hiding this comment.
Arf probably because I wanted to use the data loader only ... Maybe let's remove for now because we should be assuming that torch distributed is always initialized at least in Meg-DS IMO.
| assert counts[0].item() == ( | ||
| torch.distributed.get_world_size() // | ||
| torch.distributed.get_world_size(group=mpu.get_tensor_model_parallel_group())) | ||
| if torch.distributed.is_initialized(): |
| loss_mask = loss_mask.view(-1) | ||
| loss_mask = fast_normalize(loss_mask) |
There was a problem hiding this comment.
Maybe reshaping to the orignal structure is better API? It's better to bave the same shapes as label IMO (we still still do flatten everything)
|
|
||
|
|
||
| def test_finetune_t0_non_causal_decoder_get_batch_pipe(self): | ||
| def test_finetune_t0_get_batch_pipe(self): |
There was a problem hiding this comment.
Yeah let's make it so that the script is causal decoder specific. Let's figure out non causal decoder later on.
| group.add_argument('--append-bos', action='store_true', | ||
| help='Append a bos token to the end of a document.') | ||
| group.add_argument('--prepend-space', action='store_true', | ||
| help='Prepends a space to the beginning of a document') |
There was a problem hiding this comment.
Add a mention in which context it's useful, typically it is when you compute targets.
No description provided.