home/categories/debugging/sgl-project-sglang-claude-skills-debug-distributed-hang-skill-md
debuggingtools

debug-distributed-hang

Debug hanging issues in SGLang distributed inference (TP/PP/DP/EP). Covers identifying hang locations via py-spy/watchdog/cuda coredump, per-rank logging to find state divergence, binary-search methodology for locating the first diverge point, and fix patterns. Use when a multi-GPU SGLang run hangs, freezes, or times out during collective operations.

sgl-project
maintainer
sgl-project
업데이트됨 4/9/2026
스타
25638
포크
5275
quick start

Installation and usage

Debug hanging issues in SGLang distributed inference (TP/PP/DP/EP). Covers identifying hang locations via py-spy/watchdog/cuda coredump, per-rank logging to find state divergence, binary-search methodology for locating the first diverge point, and fix patterns. Use when a multi-GPU SGLang run hangs, freezes, or times out during collective operations.

설치
$ install --globalskills.sh
사용법

설치 후 터미널에서 다음 명령을 실행하여 이 스킬을 사용할 수 있습니다:

skills use debug-distributed-hang