Grpo | Wilson Wu

Introduction In the post-training stage of large language models, reinforcement learning (RLHF / RLAIF) has become one of the key factors that determines the upper bound of model capability. Recently, GLM-5.2 switched its training algorithm from the GRPO (Generalized Reward Policy Optimization) used in GLM-5.1 to the more classical PPO (Proximal Policy Optimization), bringing a clear improvement in results. This change is not a simple “algorithm replacement”. It is a systematic upgrade in stability, generalization, and training controllability....