ZPPO Zone of Proximal Policy Optimization

Teacher in Prompts , Not Gradients

  1. Byung-Kwan Lee
  2. Ximing Lu
  3. Shizhe Diao
  4. Minki Kang
  5. Saurav Muralidharan
  6. Karan Sapra
  7. Andrew Tao
  8. Pavlo Molchanov
  9. Yejin Choi
  10. Yu-Chiang Frank Wang
  11. Ryo Hachiuma
Project Lead
Code Internal-Use Only
Coming soon to public
Models Internal-Use Only
Coming soon to public
TL;DR

Accuracy Gain (Δ pp)

Teacher Size 27B
Student Size
Method 10 LLM Benchmarks 16 VLM Benchmarks 5 Video Benchmarks
Off-Policy Distill 0.0 0.0 0.0
On-Policy Distill 0.0 0.0 0.0
GRPO 0.0 0.0 0.0
GRPO + Teacher response 0.0 0.0 0.0
ZPPO(Ours) 0.0 0.0 0.0

†: prompt replay buffer · all experiments run on Qwen3.5

1 / 3 Off-Policy Distill and On-Policy Distill

Distillation forces a student to imitate teacher logits, inducing memorization on the training samples while degrading generalization on unseen samples. (Overfitting on dataset and teacher)

2 / 3 GRPO

RL lets model have freedom of responding the question until they solve it, encouraging reasoning exploration via self-reflection like "Wait, that step looks wrong — let me re-check." (Not forced to imitate any response) — preserving generalization. However, RL can't learn how to solve hard questions whose rollout accuracy is near zero — they are discarded forever.

3 / 3 GRPO + Teacher response

To solve hard questions, some RL methods naively inject the teacher's response into the student — as if it were the student's own response — breaking the on-policy assumption, degrading generalization again.

Insight
Research Question
For hard questions, how can we transfer the teacher's knowledge to the student without imitating the teacher's logits or injecting the teacher's response directly into the student's gradient?How to make the student solve the hard question without policy drift (degrading generalization)?
method

Technically, we use a Replay Buffer to store hard questions, so the model revisits each hard question many times — not just once, as in GRPO. Repeated exposure strengthens the BCQ/NCQ effect on each hard question, which we expect to lift its rollout accuracy.

ZPPO method overview: questions are sampled (new + replayed); the teacher and student roll out answers; the teacher-correct plus a student-wrong rollout form a BCQ prompt and the student's wrong rollouts form an NCQ prompt; hard questions are kept in the prompt replay buffer.
  1. Batch includes new questions, replayed questions, BCQ, and NCQStudent is RL-trained on them.
results

A question is admitted to the Replay Buffer when its rollout accuracy stays below 50%, and it graduates — leaving the buffer — once that accuracy reaches 50%. ZPPO graduates far more hard questions than GRPO, and the gap is widest where the initial accuracy starts near zero.

qualitative

BCQ + NCQ on hard questions.