[Bugfix] Resolve Rank index out of range during BWD when sp_size < world_size in Ulysses #7809
+52
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR addresses Issue #7672.
When sequence_parallel_size is smaller than world_size (e.g., sp_size=2 on 4 GPUs), using
torch.distributed.nn.functional.all_gatherfor loss aggregation triggers anIndexError: tuple index out of rangeduring the backward pass. This occurs because the implementation attempts to access gradient outputs using the global rank index, which exceeds the bounds of the local sequence parallel group (which only contains sp_size elements).Solution
I have replaced the problematic all_gather aggregation with a mathematically equivalent and robust all_reduce operation:
Verification
I added a new regression test
TestUlyssesLossBackwardintests/unit/sequence_parallelism/test_ulysses.py.1. Reproduction (Before Fix)

Confirmed IndexError crash on Rank 2/3 with sp_size=2 on a 4-GPU setup.
2. Verification (After Fix)

Verified the fix using the regression test logic on 4x RTX A6000. The backward pass now completes successfully on all ranks without error.