⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content

RoPE kernel performance drops when using zero-length sequences #2599

@sudhakarsingh27

Description

@sudhakarsingh27

Describe the bug

Customers are seeing performance regressions when using zero-length sequences to pad cu_seqlens to a fixed length tensor to keep CUDA Graphs happy when working with sequence packing (or THD format). Apparently, attention module itself can support zero-length sequences but in the profiles, rope kernel is taking huge chunks of time (much larger than attention itself) and drives down perf 10x.

Steps/Code to reproduce bug
Customers reported this bug when using NeMo-RL and Megatron-LM libs.
They create fixed-length cu_seqlens when using THD format (where num_seqs can vary from batch to batch.
If there are too many zero-length sequences, perf drops down a lot.

  • [] Create a proper repro for this first

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
  • Method of Transformer Engine install: [pip install or from source]. Please specify exact commands you used to install.
  • If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version
  • PyTorch version
  • Python version
  • Transformer Engine version
  • CUDA version
  • CUDNN version

Device details

  • GPU model

Additional context

Add any other context about the problem here.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions