llm-aidata-ai
grpo
Group Relative Policy Optimization for reinforcement learning from human feedback. Covers GRPOTrainer, reward function design, policy optimization, and KL divergence constraints for stable RLHF training. Includes thinking-aware reward patterns.
maintainer
atrawog
更新於 1/12/2026
星標
0
分支
0
quick start
Installation and usage
Group Relative Policy Optimization for reinforcement learning from human feedback. Covers GRPOTrainer, reward function design, policy optimization, and KL divergence constraints for stable RLHF training. Includes thinking-aware reward patterns.
安裝
$ install --globalskills.sh
使用
安裝後,您可以通過在終端運行以下命令來使用此技能:
skills use grpo