PPO vs GRPO: Comparing and Choosing Between Two Mainstream LLM Reinforcement Learning Paradigms

Introduction In the post-training stage of large language models, reinforcement learning (RLHF / RLAIF) has become one of the key factors that determines the upper bound of model capability. Recently, GLM-5.2 switched its training algorithm from the GRPO (Generalized Reward Policy Optimization) used in GLM-5.1 to the more classical PPO (Proximal Policy Optimization), bringing a clear improvement in results. This change is not a simple “algorithm replacement”. It is a systematic upgrade in stability, generalization, and training controllability....

June 28, 2026 · 5 min