gemm-optimization

Name: gemm-optimization
Author: ROCm

Comprehensive guide to optimizing GEMM (General Matrix Multiply) kernels in FlyDSL on AMD CDNA GPUs. Covers tiling strategy, LDS ping-pong double-buffer, XOR bank-conflict swizzle, A/B data prefetch pipeline, 2-stage software pipelining, MFMA instruction scheduling (hot_loop_scheduler), epilogue strategies (direct store vs CShuffle), TFLOPS/bandwidth calculation, main-loop instruction count analysis, and bottleneck identification from ATT traces. Based on the production preshuffle_gemm kernel. Usage: /gemm-optimization

查看源码 framework-internals

maintainer

ROCm

更新于 4/9/2026

星标

148

分支

quick start

Installation and usage

安装

$ install --globalskills.sh

使用

安装后，您可以通过在终端运行以下命令来使用此技能：

skills use gemm-optimization