home/categories/framework-internals/rocm-flydsl-claude-skills-gemm-optimization-skill-md
framework-internalsdevelopment

gemm-optimization

Comprehensive guide to optimizing GEMM (General Matrix Multiply) kernels in FlyDSL on AMD CDNA GPUs. Covers tiling strategy, LDS ping-pong double-buffer, XOR bank-conflict swizzle, A/B data prefetch pipeline, 2-stage software pipelining, MFMA instruction scheduling (hot_loop_scheduler), epilogue strategies (direct store vs CShuffle), TFLOPS/bandwidth calculation, main-loop instruction count analysis, and bottleneck identification from ATT traces. Based on the production preshuffle_gemm kernel. Usage: /gemm-optimization

ROCm
maintainer
ROCm
Actualizado 4/9/2026
Estrellas
148
Forks
36
quick start

Installation and usage

Comprehensive guide to optimizing GEMM (General Matrix Multiply) kernels in FlyDSL on AMD CDNA GPUs. Covers tiling strategy, LDS ping-pong double-buffer, XOR bank-conflict swizzle, A/B data prefetch pipeline, 2-stage software pipelining, MFMA instruction scheduling (hot_loop_scheduler), epilogue strategies (direct store vs CShuffle), TFLOPS/bandwidth calculation, main-loop instruction count analysis, and bottleneck identification from ATT traces. Based on the production preshuffle_gemm kernel. Usage: /gemm-optimization

Instalación
$ install --globalskills.sh
Uso

Después de instalarlo, puedes usar este skill ejecutando el siguiente comando en tu terminal:

skills use gemm-optimization