Training on Wilson Wu

Training on Wilson Wuhttps://wilsonwu.me/en/tags/training/Recent content in Training on Wilson WuHugo -- 0.127.0en-USSun, 28 Jun 2026 00:00:00 +0000PPO vs GRPO: Comparing and Choosing Between Two Mainstream LLM Reinforcement Learning Paradigmshttps://wilsonwu.me/en/blog/2026/ppo-vs-grpo/Sun, 28 Jun 2026 00:00:00 +0000https://wilsonwu.me/en/blog/2026/ppo-vs-grpo/Introduction In the post-training stage of large language models, reinforcement learning (RLHF / RLAIF) has become one of the key factors that determines the upper bound of model capability. Recently, GLM-5.2 switched its training algorithm from the GRPO (Generalized Reward Policy Optimization) used in GLM-5.1 to the more classical PPO (Proximal Policy Optimization), bringing a clear improvement in results. This change is not a simple “algorithm replacement”. It is a systematic upgrade in stability, generalization, and training controllability.