-
Notifications
You must be signed in to change notification settings - Fork 609
Description
Describe the bug
Customers are seeing performance regressions when using zero-length sequences to pad cu_seqlens to a fixed length tensor to keep CUDA Graphs happy when working with sequence packing (or THD format). Apparently, attention module itself can support zero-length sequences but in the profiles, rope kernel is taking huge chunks of time (much larger than attention itself) and drives down perf 10x.
Steps/Code to reproduce bug
Customers reported this bug when using NeMo-RL and Megatron-LM libs.
They create fixed-length cu_seqlens when using THD format (where num_seqs can vary from batch to batch.
If there are too many zero-length sequences, perf drops down a lot.
- [] Create a proper repro for this first
Expected behavior
A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)
- Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
- Method of Transformer Engine install: [pip install or from source]. Please specify exact commands you used to install.
- If method of install is [Docker], provide
docker pull&docker runcommands used
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version
- PyTorch version
- Python version
- Transformer Engine version
- CUDA version
- CUDNN version
Device details
- GPU model
Additional context
Add any other context about the problem here.