<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Training on Wilson Wu</title><link>https://wilsonwu.me/en/tags/training/</link><description>Recent content in Training on Wilson Wu</description><generator>Hugo -- 0.127.0</generator><language>en-US</language><lastBuildDate>Sun, 28 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://wilsonwu.me/en/tags/training/index.xml" rel="self" type="application/rss+xml"/><item><title>PPO vs GRPO: Comparing and Choosing Between Two Mainstream LLM Reinforcement Learning Paradigms</title><link>https://wilsonwu.me/en/blog/2026/ppo-vs-grpo/</link><pubDate>Sun, 28 Jun 2026 00:00:00 +0000</pubDate><guid>https://wilsonwu.me/en/blog/2026/ppo-vs-grpo/</guid><description>Introduction In the post-training stage of large language models, reinforcement learning (RLHF / RLAIF) has become one of the key factors that determines the upper bound of model capability. Recently, GLM-5.2 switched its training algorithm from the GRPO (Generalized Reward Policy Optimization) used in GLM-5.1 to the more classical PPO (Proximal Policy Optimization), bringing a clear improvement in results.
This change is not a simple &amp;ldquo;algorithm replacement&amp;rdquo;. It is a systematic upgrade in stability, generalization, and training controllability.</description></item></channel></rss>