home/categories/llm-ai/atrawog-bazzite-ai-plugins-bazzite-ai-jupyter-skills-grpo-skill-md
llm-aidata-ai

grpo

Group Relative Policy Optimization for reinforcement learning from human feedback. Covers GRPOTrainer, reward function design, policy optimization, and KL divergence constraints for stable RLHF training. Includes thinking-aware reward patterns.

atrawog
maintainer
atrawog
์—…๋ฐ์ดํŠธ๋จ 1/12/2026
์Šคํƒ€
0
ํฌํฌ
0
quick start

Installation and usage

Group Relative Policy Optimization for reinforcement learning from human feedback. Covers GRPOTrainer, reward function design, policy optimization, and KL divergence constraints for stable RLHF training. Includes thinking-aware reward patterns.

์„ค์น˜
$ install --globalskills.sh
์‚ฌ์šฉ๋ฒ•

์„ค์น˜ ํ›„ ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•˜์—ฌ ์ด ์Šคํ‚ฌ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

skills use grpo