home/categories/debugging/sgl-project-sglang-claude-skills-debug-distributed-hang-skill-md
debuggingtools

debug-distributed-hang

Debug hanging issues in SGLang distributed inference (TP/PP/DP/EP). Covers identifying hang locations via py-spy/watchdog/cuda coredump, per-rank logging to find state divergence, binary-search methodology for locating the first diverge point, and fix patterns. Use when a multi-GPU SGLang run hangs, freezes, or times out during collective operations.

sgl-project
maintainer
sgl-project
更新日 4/9/2026
スター
25638
フォーク
5275
quick start

Installation and usage

Debug hanging issues in SGLang distributed inference (TP/PP/DP/EP). Covers identifying hang locations via py-spy/watchdog/cuda coredump, per-rank logging to find state divergence, binary-search methodology for locating the first diverge point, and fix patterns. Use when a multi-GPU SGLang run hangs, freezes, or times out during collective operations.

インストール
$ install --globalskills.sh
使い方

インストール後、ターミナルで以下のコマンドを実行してこのスキルを使用できます:

skills use debug-distributed-hang