[PyTorch] Add grouped linear op and experimental fusion for grouped MLP #2622

timmoon10 · 2026-01-24T06:40:00Z

Description

This PR adds a grouped linear op, which can be used in the grouped MLP block in Mixture-of-Experts models. It also adds an experimental fused operation for a grouped MLP block, using a CuTe DSL kernel that computes an MXFP8 grouped GEMM and SwiGLU.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Add a grouped linear operation
Add a post-scaled SwiGLU op and add support for interleaving SwiGLU gate and linear units
Add a fused operation for grouped MLP

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <[email protected]>

Refactor fusion functions to remove index bookkeeping. Refactor fused ops to use consistent operation order. Signed-off-by: Tim Moon <[email protected]>

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

Test is too permissive since the test should still be failing. The weights are not properly interleaved yet. Signed-off-by: Tim Moon <[email protected]>

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

@greptile-apps

* Expose option for custom op fusions Refactor fusion functions to remove index bookkeeping. Refactor fused ops to use consistent operation order. Signed-off-by: Tim Moon <[email protected]> * Add tests for custom ops Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warnings and numerical test failures Signed-off-by: Tim Moon <[email protected]> * Tweak pattern matching logic with fixed window sizes Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use TF32 tols in fused op tests Signed-off-by: Tim Moon <[email protected]> * Review suggestion from @greptile-apps Signed-off-by: Tim Moon <[email protected]> * Backpropagate fixes from #2622 Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2026-01-25T01:00:24Z

/te-ci pytorch L1

greptile-apps · 2026-01-25T01:03:24Z

Greptile Overview

Greptile Summary

Adds grouped linear operations and experimental MXFP8 fusion for Mixture-of-Experts grouped MLP blocks.

Key Changes:

Introduced GroupedLinear operation that applies multiple linear transformations by splitting input along first dimension, enabling efficient expert parallelism in MoE models
Refactored SwiGLU operations from activation.py into dedicated swiglu.py module, adding ScaledSwiGLU with post-scaling and optional gate/linear unit interleaving
Implemented experimental ForwardGroupedMLP_CuTeGEMMSwiGLU_MXFP8 fusion using CuTe DSL kernel from cuDNN (requires SM100+) that fuses grouped GEMM + SwiGLU + post-scale into single kernel
Full FP8/MXFP8 quantization support with rowwise/columnwise quantizers throughout the operation chain
Comprehensive test coverage including quantization variants, gradient checking, and fusion verification

Minor Issue:

Missing f prefix on f-string at line 90 of forward_grouped_mlp.py

Confidence Score: 4.5/5

Safe to merge after fixing the f-string syntax issue on line 90
Well-architected implementation with comprehensive test coverage. All previously identified issues have been resolved except one minor f-string syntax error. The grouped linear and fusion logic is sound, with proper quantization handling and backward pass implementation.
transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py requires fix on line 90

Important Files Changed

Filename	Overview
transformer_engine/pytorch/ops/basic/grouped_linear.py	new file implementing grouped linear operations for MoE models with proper quantization support
transformer_engine/pytorch/ops/basic/swiglu.py	refactored SwiGLU operations from activation.py, added ScaledSwiGLU with interleaving support
transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py	experimental CuTe DSL kernel fusion for MXFP8 grouped MLP, one f-string syntax issue on line 90

Sequence Diagram

sequenceDiagram
    participant User
    participant GroupedMLP as Grouped MLP Module
    participant FC1 as GroupedLinear (FC1)
    participant SwiGLU as ScaledSwiGLU
    participant FC2 as GroupedLinear (FC2)
    participant FusedOp as ForwardGroupedMLP_CuTeGEMMSwiGLU_MXFP8
    participant Quantizer as FP8 Quantizers

    User->>GroupedMLP: forward(input, split_sizes, probs)
    
    alt Fusion Available (MXFP8 + SM100+)
        GroupedMLP->>FusedOp: fuser_forward(input, split_sizes, probs)
        FusedOp->>Quantizer: quantize inputs & weights (MXFP8)
        Quantizer-->>FusedOp: quantized tensors
        FusedOp->>FusedOp: grouped_gemm_swiglu_kernel()
        Note over FusedOp: CuTe DSL kernel fuses:<br/>FC1 GEMM + SwiGLU + scaling
        FusedOp->>FC2: grouped GEMM for FC2
        FC2-->>FusedOp: output
        FusedOp-->>GroupedMLP: final output
    else Standard Path
        GroupedMLP->>FC1: forward(input, split_sizes)
        FC1->>FC1: split input by groups
        FC1->>Quantizer: quantize inputs/weights if FP8
        FC1->>FC1: general_grouped_gemm()
        FC1-->>GroupedMLP: FC1 output
        
        GroupedMLP->>SwiGLU: forward(FC1_out, probs)
        SwiGLU->>SwiGLU: remove gate interleaving
        SwiGLU->>SwiGLU: swiglu(gate, linear)
        SwiGLU->>SwiGLU: multiply by probs (post-scale)
        SwiGLU-->>GroupedMLP: scaled output
        
        GroupedMLP->>FC2: forward(SwiGLU_out, split_sizes)
        FC2->>FC2: split input by groups
        FC2->>Quantizer: quantize inputs/weights if FP8
        FC2->>FC2: general_grouped_gemm()
        FC2-->>GroupedMLP: final output
    end
    
    GroupedMLP-->>User: output

Signed-off-by: Tim Moon <[email protected]>

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Tim Moon <[email protected]>

yaox12 · 2026-01-27T09:59:48Z

transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py

+            quantizer.optimize_for_gemm = True
+        fc1_xs = tex.split_quantize(fc1_x, split_sizes_cpu, fc1_input_quantizers)
+
+        # Pack data tensors


May be a silly question: are these packing and unpacking code just for verification? Or will they be in the final version?

I'm working on getting rid of the concatenations, but the permutes are no-ops. The kernel API expects tensors with non-contiguous dims: https://github.com/NVIDIA/cudnn-frontend/blob/main/python/cudnn/grouped_gemm/grouped_gemm_swiglu/api.py#L240-L245

skydoorkai · 2026-01-28T05:30:55Z

transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py

+        )
+
+        # Fused kernel for FC1 + SwiGLU + post-scale
+        fc1_kernel_out = self.grouped_gemm_swiglu_kernel()(


After swiglu, it usually needs to multiply with permuted_probs. Does this weighted swiglu supported?

Yes, the probs are passed into the kernel here: https://github.com/timmoon10/TransformerEngine/blob/46294be478f6551e2cf251283adc7529ddb2964e/transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py#L264

Signed-off-by: Tim Moon <[email protected]>

@greptile-apps

Review suggestions from @greptile-apps Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Tim Moon <[email protected]>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Tim Moon <[email protected]>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-30T04:22:13Z

transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py

+        if swiglu.glu_interleave_size != 32:
+            raise ValueError(
+                "Fused kernel requires 32-wide GLU interleaving, "
+                "but got glu_interleave_size={swiglu.glu_interleave_size}."


missing f prefix for f-string interpolation

Suggested change

"but got glu_interleave_size={swiglu.glu_interleave_size}."

f"but got glu_interleave_size={swiglu.glu_interleave_size}."

timmoon10 and others added 30 commits January 7, 2026 00:15

Naive implementation of grouped linear op

5175aad

Signed-off-by: Tim Moon <[email protected]>

Use grouped GEMM tex functions

5ffd57e

Signed-off-by: Tim Moon <[email protected]>

Support quantized compute

2ee42da

Signed-off-by: Tim Moon <[email protected]>

Debug test failures with MXFP8 or NVFP4 params

93e71df

Signed-off-by: Tim Moon <[email protected]>

Add multiply op

fdddc47

Signed-off-by: Tim Moon <[email protected]>

Bug fixes

b448a17

Signed-off-by: Tim Moon <[email protected]>

Expose option for custom op fusions

3f38897

Refactor fusion functions to remove index bookkeeping. Refactor fused ops to use consistent operation order. Signed-off-by: Tim Moon <[email protected]>

Add tests for custom ops

a359b67

Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5f7204f

for more information, see https://pre-commit.ci

Fix linter warnings and numerical test failures

8ddb8ce

Signed-off-by: Tim Moon <[email protected]>

Tweak pattern matching logic with fixed window sizes

cfc2617

Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

0ce5dfb

for more information, see https://pre-commit.ci

Merge branch 'main' into tmoon/custom-fused-ops

9bf5843

Use TF32 tols in fused op tests

4992903

Signed-off-by: Tim Moon <[email protected]>

Review suggestion from @greptile-apps

9ab7751

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into tmoon/custom-fused-ops

a086d81

Merge branch 'main' into tmoon/grouped-linear-op

f05f7a8

Fix linter warnings

9348138

Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5366729

for more information, see https://pre-commit.ci

Merge branch 'tmoon/grouped-linear-op' into tmoon/cute-gemm-swiglu

1b0b229

Merge branch 'tmoon/custom-fused-ops' into tmoon/cute-gemm-swiglu

3bbe881

Initial impl of fused op for grouped MLP

321646e

Signed-off-by: Tim Moon <[email protected]>

Import group GEMM+SwiGLU kernel

e137451

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into tmoon/cute-gemm-swiglu

11da59d

Signed-off-by: Tim Moon <[email protected]>

Add unit test for grouped MLP op

cb728bb

Signed-off-by: Tim Moon <[email protected]>

Call fused group GEMM + SwiGLU kernel

e7459cc

Signed-off-by: Tim Moon <[email protected]>

Debug test failures

b15ca0d

Test is too permissive since the test should still be failing. The weights are not properly interleaved yet. Signed-off-by: Tim Moon <[email protected]>

Get test to not pass trivially

3da2c17

Signed-off-by: Tim Moon <[email protected]>

Handle interleaving for SwiGLU

0270eb1

Signed-off-by: Tim Moon <[email protected]>

Fix numeric tests, except for probs grad

0b09790

Signed-off-by: Tim Moon <[email protected]>

timmoon10 added the performance Performance issues label Jan 24, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

4c6c35f

for more information, see https://pre-commit.ci

timmoon10 mentioned this pull request Jan 24, 2026

[PyTorch] Support grouped linear op in te.Sequential #2560

Open

Remove MultiplyExtraInput op

caf580b

Signed-off-by: Tim Moon <[email protected]>

timmoon10 added a commit to timmoon10/TransformerEngine that referenced this pull request Jan 24, 2026

Backpropagate fixes from NVIDIA#2622

f47c6bf

Signed-off-by: Tim Moon <[email protected]>

timmoon10 and others added 2 commits January 25, 2026 00:25

Merge branch 'main' into tmoon/cute-gemm-swiglu

b36007e

Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

36e6918

for more information, see https://pre-commit.ci

timmoon10 mentioned this pull request Jan 25, 2026

[PyTorch] Add fusible ops for MoE #2605

Closed

13 tasks

timmoon10 changed the title ~~[PyTorch] Prototype of fused operation for grouped MLP~~ [PyTorch] Add grouped linear op and experimental fusion for grouped MLP Jan 25, 2026

Fix linter warnings

ba28c6f

Signed-off-by: Tim Moon <[email protected]>

timmoon10 marked this pull request as ready for review January 25, 2026 01:00

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

Review suggestions from @greptile-apps

575da6e

Signed-off-by: Tim Moon <[email protected]>

This comment was marked as resolved.

Sign in to view

Apply suggestion from @greptile-apps[bot]

46294be

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Tim Moon <[email protected]>

This comment was marked as outdated.

Sign in to view

yaox12 reviewed Jan 27, 2026

View reviewed changes

skydoorkai reviewed Jan 28, 2026

View reviewed changes

Tweak variable names

fccb0bb

Signed-off-by: Tim Moon <[email protected]>

This comment was marked as resolved.

Sign in to view

Fix f-strings

4259e27

Review suggestions from @greptile-apps Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Tim Moon <[email protected]>

greptile-apps bot reviewed Jan 30, 2026

View reviewed changes

Fix bug when grouped MLP is not being trained

2442d34

Signed-off-by: Tim Moon <[email protected]>

greptile-apps bot reviewed Jan 30, 2026

View reviewed changes

	"but got glu_interleave_size={swiglu.glu_interleave_size}."
	f"but got glu_interleave_size={swiglu.glu_interleave_size}."

[PyTorch] Add grouped linear op and experimental fusion for grouped MLP #2622

Are you sure you want to change the base?

[PyTorch] Add grouped linear op and experimental fusion for grouped MLP #2622

Uh oh!

Conversation

timmoon10 commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

timmoon10 commented Jan 25, 2026

Uh oh!

greptile-apps bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4.5/5

Important Files Changed

Sequence Diagram

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as outdated.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as outdated.

Uh oh!

yaox12 Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

skydoorkai Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

timmoon10 commented Jan 24, 2026 •

edited

Loading

greptile-apps bot commented Jan 25, 2026 •

edited

Loading