home/categories/debugging/sgl-project-sglang-claude-skills-debug-distributed-hang-skill-md
debuggingtools

debug-distributed-hang

Debug hanging issues in SGLang distributed inference (TP/PP/DP/EP). Covers identifying hang locations via py-spy/watchdog/cuda coredump, per-rank logging to find state divergence, binary-search methodology for locating the first diverge point, and fix patterns. Use when a multi-GPU SGLang run hangs, freezes, or times out during collective operations.

sgl-project
maintainer
sgl-project
Actualizado 4/9/2026
Estrellas
25638
Forks
5275
quick start

Installation and usage

Debug hanging issues in SGLang distributed inference (TP/PP/DP/EP). Covers identifying hang locations via py-spy/watchdog/cuda coredump, per-rank logging to find state divergence, binary-search methodology for locating the first diverge point, and fix patterns. Use when a multi-GPU SGLang run hangs, freezes, or times out during collective operations.

Instalación
$ install --globalskills.sh
Uso

Después de instalarlo, puedes usar este skill ejecutando el siguiente comando en tu terminal:

skills use debug-distributed-hang