gemm-optimization

Name: gemm-optimization
Author: ROCm

Comprehensive guide to optimizing GEMM (General Matrix Multiply) kernels in FlyDSL on AMD CDNA GPUs. Covers tiling strategy, LDS ping-pong double-buffer, XOR bank-conflict swizzle, A/B data prefetch pipeline, 2-stage software pipelining, MFMA instruction scheduling (hot_loop_scheduler), epilogue strategies (direct store vs CShuffle), TFLOPS/bandwidth calculation, main-loop instruction count analysis, and bottleneck identification from ATT traces. Based on the production preshuffle_gemm kernel. Usage: /gemm-optimization

檢視原始碼 framework-internals

maintainer

ROCm

更新於 4/9/2026

星標

148

分支

quick start

Installation and usage

安裝

$ install --globalskills.sh

使用

安裝後，您可以透過在終端機執行以下指令來使用此技能：

skills use gemm-optimization