home/categories/framework-internals/slowlyc-agent-gpu-skills-cutlass-skill-skill-md
framework-internalsdevelopment

cutlass-skill

Write, debug, and optimize CUTLASS and CuTeDSL GPU kernels using local source code, examples, and header references. Use when the user mentions CUTLASS, CuTe, CuTeDSL, cute::Layout, cute::Tensor, TiledMMA, TiledCopy, CollectiveMainloop, CollectiveEpilogue, GEMM kernel, grouped GEMM, sparse GEMM, flash attention CUTLASS, blackwell GEMM, hopper GEMM, FP8 GEMM, FP4 GEMM, blockwise scaling, MoE GEMM, StreamK, warp specialization CUTLASS, TMA CUTLASS, epilogue fusion, EVT (Epilogue Visitor Tree), pycute, Layout algebra, Swizzle pattern, GemmUniversal, KernelSchedule, EpilogueSchedule, CUTLASS collective builder, CUTLASS pipeline, or asks about writing high-performance CUDA kernels with CUTLASS/CuTe templates. Also use when the user wants to understand CUTLASS source code structure, compile CUTLASS examples, or debug CUTLASS template errors.

slowlyC
maintainer
slowlyC
Updated 3/19/2026
Stars
82
Forks
7
quick start

Installation and usage

Write, debug, and optimize CUTLASS and CuTeDSL GPU kernels using local source code, examples, and header references. Use when the user mentions CUTLASS, CuTe, CuTeDSL, cute::Layout, cute::Tensor, TiledMMA, TiledCopy, CollectiveMainloop, CollectiveEpilogue, GEMM kernel, grouped GEMM, sparse GEMM, flash attention CUTLASS, blackwell GEMM, hopper GEMM, FP8 GEMM, FP4 GEMM, blockwise scaling, MoE GEMM, StreamK, warp specialization CUTLASS, TMA CUTLASS, epilogue fusion, EVT (Epilogue Visitor Tree), pycute, Layout algebra, Swizzle pattern, GemmUniversal, KernelSchedule, EpilogueSchedule, CUTLASS collective builder, CUTLASS pipeline, or asks about writing high-performance CUDA kernels with CUTLASS/CuTe templates. Also use when the user wants to understand CUTLASS source code structure, compile CUTLASS examples, or debug CUTLASS template errors.

Installation
$ install --globalskills.sh
Usage

Once installed, you can use this skill by running the following command in your terminal:

skills use cutlass-skill