Enable loading ckpt for t0 finetuning by Muennighoff · Pull Request #309 · bigscience-workshop/Megatron-DeepSpeed

Muennighoff · 2022-07-10T11:15:43Z

No description provided.

megatron/utils.py

thomasw21 · 2022-07-11T08:52:38Z

megatron/utils.py

+            [0, 0, 0, 1, 0, 0, 0],
+            [0, 0, 0, 1, 1, 0, 0],
+            [0, 0, 0, 1, 1, 1, 0],
+            [0, 0, 0, 0, 0, 0, 0]]]]


Suggested change

[0, 0, 0, 0, 0, 0, 0]]]]

[0, 0, 0, 0, 0, 0, 1]]]]

I don't think there is a 1 , because the last row & column is 100% padding

Hum I'm wondering if this doesn't screw something up. Essentially you're going to compute softmax on a row with only zeros ...

The last row & last col are the attention scores of the last token with respect to the last token. Since the last token is masked out in our loss_mask it doesn't matter I think.
Also it's a row with only -inf, no?

No you compute softmax, what should be the result of the softmax of a row full of masked out values .... It feels like that would return lots of Nans.

Don't we fill it with -inf?
And the softmax of a row where all values are the same is just 1/n, no? Where would it cause NaNs?

You can try writing a test but I would be pretty sure that the actual results are 0. (with current kernel)

tools/preprocess_data.py

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

finetune_t0_non_causal_decoder.py

thomasw21 · 2022-07-12T08:53:16Z

finetune_t0_non_causal_decoder.py

        # Run non-causal decoder
-        is_causal=False,
-        causal_mask=~(causal_mask.bool()),
+        is_causal=True,


let's rename this file finetune_t0_causal_decoder then

What about just finetune_t0.py?

Right but do we hardcode this everytime? I'd rather have this one be the script for causal decoder.

Added an argument prefixlm

finetune_t0_non_causal_decoder.py

megatron/model/gpt_model.py

thomasw21 · 2022-07-12T09:00:29Z

megatron/utils.py

+            [0, 0, 0, 1, 0, 0, 0],
+            [0, 0, 0, 1, 1, 0, 0],
+            [0, 0, 0, 1, 1, 1, 0],
+            [0, 0, 0, 0, 0, 0, 0]]]]


No you compute softmax, what should be the result of the softmax of a row full of masked out values .... It feels like that would return lots of Nans.

* Tmp lossseq * Efficient loss normalization * Reuse variable * Simplify division * Add norm_target_loss arg * Clarify loss on targets & remove kwarg * Loss mask is already float * Move norm to batch pipe * Reshape loss mask * Move view

thomasw21

Nice work! Some things I think shouldn't be in this PR.

thomasw21 · 2022-11-04T08:41:54Z

megatron/arguments.py

                       'This is mostly used for prefix_lm training')
    group.add_argument("--noise-density", type=float, default=None, help="Span corruption noise density")
    group.add_argument("--mean-noise-span-length", type=int, default=None, help="Span corruption mean noise span length")
+    group.add_argument("--prefixlm",  action='store_true', help="Whether to train a PrefixLM - To be used with finetune t0")


Yeah actually let's remove that option. I don't think we've trained one successfully. We'll probably do as people have shown that it works but in another PR IMO.

thomasw21 · 2022-11-04T08:43:39Z

megatron/checkpointing.py

    if args.deepspeed:
-        load_optimizer_states = False if args.no_load_optim else True
-        loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
+        load_optimizer_states = not args.no_load_optim


Just use no_load_optim directly in the method

thomasw21 · 2022-11-04T08:44:05Z

megatron/checkpointing.py


    # Set iteration.
-    if args.finetune or release:
+    if args.finetune or release or args.reset_progress:


Why is it that we didn't set finetune to True?

thomasw21 · 2022-11-04T08:46:44Z

megatron/checkpointing.py

    assert args.consumed_train_samples == 0
    assert args.consumed_valid_samples == 0
-    if 'args' in state_dict:
+    if 'args' in state_dict and not args.reset_progress:


Can you add a comment? Typically this is only used because the metadata loading mechanism screws with us.

thomasw21 · 2022-11-04T08:47:22Z

megatron/data/decoder_packed_mtf_dataset.py


    # Build the indexed mapping if not exist.
-    if torch.distributed.get_rank() == 0:
+    if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:


Why do you need that?

Afaik you added this code; I think it was for running tests or sth

Arf probably because I wanted to use the data loader only ... Maybe let's remove for now because we should be assuming that torch distributed is always initialized at least in Meg-DS IMO.

thomasw21 · 2022-11-04T08:49:30Z

megatron/data/decoder_packed_mtf_dataset.py

-    assert counts[0].item() == (
-        torch.distributed.get_world_size() //
-        torch.distributed.get_world_size(group=mpu.get_tensor_model_parallel_group()))
+    if torch.distributed.is_initialized():


thomasw21 · 2022-11-04T09:15:56Z

finetune_t0.py

+        loss_mask = loss_mask.view(-1)
+        loss_mask = fast_normalize(loss_mask)


Maybe reshaping to the orignal structure is better API? It's better to bave the same shapes as label IMO (we still still do flatten everything)

thomasw21 · 2022-11-04T09:19:45Z

tests/test_dataloaders.py



-    def test_finetune_t0_non_causal_decoder_get_batch_pipe(self):
+    def test_finetune_t0_get_batch_pipe(self):


Yeah let's make it so that the script is causal decoder specific. Let's figure out non causal decoder later on.

thomasw21 · 2022-11-04T09:20:25Z

tools/preprocess_data.py

+    group.add_argument('--append-bos', action='store_true',
+                       help='Append a bos token to the end of a document.')
+    group.add_argument('--prepend-space', action='store_true',
+                    help='Prepends a space to the beginning of a document')


Add a mention in which context it's useful, typically it is when you compute targets.

Enable loading ckpt for t0 finetuning

90b8f46

thomasw21 reviewed Jul 11, 2022

View reviewed changes

Muennighoff added 2 commits July 11, 2022 12:40

Swap decoder_is_inputs & segment_ids

abdd703

Add prepend-space arg

0fcb19c

thomasw21 reviewed Jul 11, 2022

View reviewed changes

tools/preprocess_data.py Outdated Show resolved Hide resolved

Muennighoff and others added 6 commits July 11, 2022 15:02

Update tools/preprocess_data.py

63daa46

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

Add helpers & set is_causal to true

89460c0

Merge remote

fb8ecb8

JSON helper scripts

a55d2fb

Remove unnec imports

2dfe5d1

Remove helper scripts

ca740f1

thomasw21 reviewed Jul 12, 2022

View reviewed changes

Muennighoff added 3 commits July 13, 2022 21:16

Avoid loading module when not loading optim

cb0313b

Allow not using torch distributed

b62dcaf

Add prefixlm arg

b15ca2d

Muennighoff mentioned this pull request Jul 16, 2022

Add t0 scripts bigscience-workshop/bigscience#50

Closed

Muennighoff requested a review from thomasw21 July 18, 2022 17:34

Muennighoff added 4 commits July 28, 2022 11:52

Add bos option

dc8d0ab

Merge branch 'main' into t0loading

0a32459

Add reset-progress key

2699721

Add option to normalize loss per target (#326)

1e77844

* Tmp lossseq * Efficient loss normalization * Reuse variable * Simplify division * Add norm_target_loss arg * Clarify loss on targets & remove kwarg * Loss mask is already float * Move norm to batch pipe * Reshape loss mask * Move view

thomasw21 requested changes Nov 4, 2022

View reviewed changes

		loss_mask = loss_mask.view(-1)
		loss_mask = fast_normalize(loss_mask)



		def test_finetune_t0_non_causal_decoder_get_batch_pipe(self):
		def test_finetune_t0_get_batch_pipe(self):

Conversation

Muennighoff commented Jul 10, 2022

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasw21 Jul 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasw21 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thomasw21 Jul 12, 2022 •

edited

Loading