Jiaqi’s Tech Notes
  • TechNotes
  • Personal Homepage

On this page

  • Summary
  • Multiple Samples for the Same Question
  • From the Perspective of Its Components
  • From the Perspective of the Key Computations

RL: DeepSeek R1 vs Kimi K1.5

RL
LLM
Published

March 31, 2025

Modified

April 20, 2026

Suggested reading order:

  1. RL: DeepSeek R1 vs Kimi K1.5
  2. Policy Gradient, PPO, GRPO
  3. Kimi K1.5 Technical Report
  4. LLM RL Baseline

Summary

In my view, these methods differ in form but are similar in spirit. The RL optimization parts of K1.5 and R1 can still be understood under a broader policy-gradient-style framework.

GRPO can be viewed this way because PPO itself belongs to the policy gradient family, and one important difference between GRPO and PPO lies in how the advantage-like training signal is constructed. In particular, GRPO replaces the usual critic-based value estimate with a relative comparison among multiple sampled outputs for the same prompt.

As for K1.5, if we break down its optimization procedure, it still contains elements such as a surrogate-style objective, an advantage-like weighting term, and gradient-based parameter updates. These are all central ingredients in policy-gradient-style optimization. Although the K1.5 authors motivate the method from policy mirror descent, it is still reasonable, from an optimization perspective, to understand it together with policy-gradient methods within a broader common framework.

Multiple Samples for the Same Question

At the same time, both K1.5 and GRPO involve sampling multiple outputs for the same prompt. These samples are then used to construct quantities such as relative advantages or baselines, which avoids the need to train and maintain a separate critic model, that is, a value function.

This is one reason GRPO is often viewed as an innovation: instead of relying on a learned value baseline, it derives a baseline-like signal by comparing multiple sampled outputs for the same question.

GRPO baseline illustration
Source: A vision researcher’s guide to some RL stuff: PPO & GRPO

From the Perspective of Its Components

Policy Gradient PPO GRPO Kimi K1.5
policy model \(\pi_\theta\) Required. This is the model being optimized. Required. This is the model being optimized. Required. This is the model being optimized. Required. This is the model being optimized.
reference policy model \(\pi_{\theta_k}\) Not required. Required. Required. Required.
reward model Required. Required. Required. Required.
value function \(V_{\phi_k}\) (critic model) Required. It estimates the expected future return, and methods such as GAE use it to compute the advantage. It must be continuously updated and kept accurate. Required. It estimates the expected future return, and methods such as GAE use it to compute the advantage. It must be continuously updated and kept accurate. Not required, because the computation of the advantage avoids this path. Not required. Since the gradient can be written explicitly, the baseline-like term can be approximated by the sample mean.
rewards-to-go \(\hat{R}_t\) Required, for updating the value function. Required, for updating the value function. Exactly how the reward is distributed back to each token depends on the implementation. Not needed, since there is no critic. Not needed, since there is no critic.

From the Perspective of the Key Computations

Policy Gradient
objective function \(J(\pi_\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)]\)
surrogate objective function None
gradient \(\nabla_\theta \log \pi_\theta \cdot A^{\pi_\theta}\)
advantage \(A^{\pi_\theta}(s,a)=Q^{\pi_\theta}(s,a)-V^{\pi_\theta}(s)\)
PPO
objective function \(J(\pi_\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)]\)
surrogate objective function \[L(\theta)=\mathbb{E}_t\left[\min\left(\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\mathrm{old}}}(a_t\mid s_t)}A_t,\ \operatorname{clip}\left(\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\mathrm{old}}}(a_t\mid s_t)},1-\epsilon,1+\epsilon\right)A_t\right)\right]\]
gradient No closed-form expression; computed numerically
advantage \(A^\pi(s,a)=Q^\pi(s,a)-V^\pi(s)\)   (typically estimated with GAE)
GRPO
objective function \(J(\pi_\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)]\)
surrogate objective function \[L(\theta)=\mathbb{E}_{o}\left[\min\left(\frac{\pi_\theta(o\mid q)}{\pi_{\theta_{\mathrm{old}}}(o\mid q)}A_o,\ \operatorname{clip}\left(\frac{\pi_\theta(o\mid q)}{\pi_{\theta_{\mathrm{old}}}(o\mid q)},1-\epsilon,1+\epsilon\right)A_o\right)\right]\]  (ignoring the KL term)
gradient No closed-form expression; computed numerically
advantage \[\hat A_i=\frac{r_i-\bar r}{\operatorname{std}(\{r_1,\ldots,r_G\})}\]  (multiple samples for the same question; no critic model; no GAE)
Kimi K1.5
objective function \[\max_\theta\ \mathbb{E}_{(y,z)\sim\pi_\theta}[r(x,y,y^*)]-\tau\,\mathrm{KL}\big(\pi_\theta(x)\,\|\,\pi_{\theta_i}(x)\big)\]
surrogate objective function \[L(\theta)=\mathbb{E}_{(x,y^*)\sim\mathcal D}\left[\mathbb{E}_{(y,z)\sim\pi_{\theta_i}}\left[\left(r(x,y,y^*)-\tau\log Z-\tau\log\frac{\pi_\theta(y,z\mid x)}{\pi_{\theta_i}(y,z\mid x)}\right)^2\right]\right]\]
gradient \[\frac{1}{k}\sum_{j=1}^{k}\left[(r(x,y_j,y^*)-\bar r)\cdot\nabla_\theta\log\pi_\theta(y_j,z_j\mid x)-\frac{\tau}{2}\nabla_\theta\left(\log\frac{\pi_\theta(y_j,z_j\mid x)}{\pi_{\theta_i}(y_j,z_j\mid x)}\right)^2\right]\]
advantage \(r-\bar r\)   (multiple samples for the same question; no critic model; no GAE)

Technical notes and blog posts.