home/categories/debugging/sgl-project-sglang-claude-skills-debug-distributed-hang-skill-md
debuggingtools

debug-distributed-hang

Debug hanging issues in SGLang distributed inference (TP/PP/DP/EP). Covers identifying hang locations via py-spy/watchdog/cuda coredump, per-rank logging to find state divergence, binary-search methodology for locating the first diverge point, and fix patterns. Use when a multi-GPU SGLang run hangs, freezes, or times out during collective operations.

sgl-project
maintainer
sgl-project
آخر تحديث 4/9/2026
النجوم
25638
التفرعات
5275
quick start

Installation and usage

Debug hanging issues in SGLang distributed inference (TP/PP/DP/EP). Covers identifying hang locations via py-spy/watchdog/cuda coredump, per-rank logging to find state divergence, binary-search methodology for locating the first diverge point, and fix patterns. Use when a multi-GPU SGLang run hangs, freezes, or times out during collective operations.

التثبيت
$ install --globalskills.sh
الاستخدام

بعد التثبيت، يمكنك استخدام هذه المهارة بتشغيل الأمر التالي في الطرفية:

skills use debug-distributed-hang