Agent eXplorative Policy Optimization

for Multimodal Agentic Reasoning

1NVIDIA 2KAIST
*Work done during internship at NVIDIA. Project lead.

TL;DR

GRPO 8B
Tool Usage
Accuracy
AXPO 8B
Tool Usage
Accuracy
Agent Baseline 32B
Tool Usage
Accuracy
Through training, GRPO 8B's Tool Usage shrinks, while AXPO 8B grows both — and AXPO's Accuracy bar even surpasses the Agent Baseline 32B (Qwen3-VL-32B-Thinking, larger) reference row below.

Problem: Standard GRPO with Agent leads to Tool Collapse.

Under standard GRPO, tool-using rollouts mostly fail while no-tool-using rollouts mostly succeed, so the model avoids using tool — the tool call gets no positive signal precisely when it would help.

Solution: Tool-call Resampling to find correct ones.

AXPO (Ours) resamples the tool-using responses again to find correct tool-using responses. In the end, tool usage recovers, the model learns to actually use tools, and performance climbs across nine multimodal benchmarks.

+7.9pp
Pass@1 at 8B
vs. baseline
+6.2pp
Pass@4 at 8B
vs. baseline
4×
Fewer params
beats 32B Base @ Pass@4
9
Multimodal benchmarks
(Reasoning · Perception · Search)

Method

Pos reward Neg reward Correct Wrong Frozen prefix
GRPO Example
Question
tool-think tool call output
Tool Collapse
AXPO Step 1 — Take the full GRPO trajectory
Question
tool-think + tool call output
Tool-usage recovered
Step 2 — Resample tool calls from Question and tool-think prefix
Question + tool-think +
tool call₁ output₁
tool call₂ output₂
tool call₃ + output₃ +
Tool-usage recovered
AXPO method diagram: tool-call resampling re-rolls a frozen question + tool-think prefix after a tool-collapsed GRPO rollout, recovering tool usage without sacrificing reward signal.
AXPO at a glance. When a GRPO rollout tool-collapses on the way to a wrong answer, AXPO freezes the Question + Tool-think prefix and re-rolls only the tail — forcing the policy to actually emit a tool call. Step 1 and Step 2 above visualize the same procedure live.

Highlight

(a) Only AXPO lifts both tool usage and accuracy

Tool-usage rate vs. accuracy at the 8B scale: GRPO trades tool usage for a small accuracy gain, while AXPO lifts both, matching the 4× larger Qwen3-VL-32B Thinking baseline.
From Qwen3-VL-8B Thinking, GRPO sacrifices tool usage for a small accuracy gain, while AXPO improves both — matching the 4× larger Qwen3-VL-32B Thinking baseline. (Note that SFT is applied before training GRPO and AXPO.)

(b) AXPO narrows the agentic gap to a 4× larger baseline

Figure 1: AXPO scaling on Pass@1 and Pass@4 across model sizes.

Average Pass@1 (left) and Pass@4 (right) over nine multimodal benchmarks across Qwen3-VL-Thinking scales. At 8B, SFT + AXPO surpasses the 32B Base on Pass@4 while outperforming GRPO at every scale. Lines: Base (no train, gray) · SFT (peach) · SFT + GRPO (red) · SFT + AXPO (blue, ours).

Full Results

Model size
Benchmarks

Qualitative Examples — GRPO vs. AXPO