NVIDIA-AXPO

TL;DR

GRPO 8B

Tool Usage

Accuracy

AXPO 8B

Tool Usage

Accuracy

Agent Baseline 32B

Tool Usage

Accuracy

Through training, GRPO 8B's Tool Usage shrinks, while AXPO 8B grows both — and AXPO's Accuracy bar even surpasses the Agent Baseline 32B (Qwen3-VL-32B-Thinking, 4× larger) reference row below.

Problem: Standard GRPO with Agent leads to Tool Collapse.

Under standard GRPO, tool-using rollouts mostly fail while no-tool-using rollouts mostly succeed, so the model avoids using tool — the tool call gets no positive signal precisely when it would help.

Solution: Tool-call Resampling to find correct ones.

AXPO (Ours) resamples the tool-using responses again to find correct tool-using responses. In the end, tool usage recovers, the model learns to actually use tools, and performance climbs across nine multimodal benchmarks.

Method

Pos reward Neg reward Correct Wrong Frozen prefix

GRPO Example

Question →

tool-think − → tool call − → output − → ✗

Tool Collapse

AXPO Step 1 — Take the full GRPO trajectory

Question →

tool-think − + → tool call − → output − → ✗

Tool-usage recovered

Step 2 — Resample tool calls from Question and tool-think prefix

Question + tool-think +

tool call₁ − → output₁ − → ✗

tool call₂ − → output₂ − → ✗

tool call₃ + → output₃ + → ✓

Tool-usage recovered

AXPO method diagram: tool-call resampling re-rolls a frozen question + tool-think prefix after a tool-collapsed GRPO rollout, recovering tool usage without sacrificing reward signal. — **AXPO at a glance.** When a GRPO rollout tool-collapses on the way to a wrong answer, AXPO freezes the *Question + Tool-think* prefix and re-rolls only the tail — forcing the policy to actually emit a tool call. Step 1 and Step 2 above visualize the same procedure live.

Highlight

(a) Only AXPO lifts both tool usage and accuracy

Tool-usage rate vs. accuracy at the 8B scale: GRPO trades tool usage for a small accuracy gain, while AXPO lifts both, matching the 4× larger Qwen3-VL-32B Thinking baseline. — From Qwen3-VL-8B Thinking, GRPO sacrifices tool usage for a small accuracy gain, while **AXPO** improves both — matching the **4× larger** Qwen3-VL-32B Thinking baseline. (Note that SFT is applied before training GRPO and AXPO.)

(b) AXPO narrows the agentic gap to a 4× larger baseline

Figure 1: AXPO scaling on Pass@1 and Pass@4 across model sizes.

Average Pass@1 (left) and Pass@4 (right) over nine multimodal benchmarks across Qwen3-VL-Thinking scales. At 8B, SFT + AXPO surpasses the 32B Base on Pass@4 while outperforming GRPO at every scale. Lines: Base (no train, gray) · SFT (peach) · SFT + GRPO (red) · SFT + AXPO (blue, ours).

Full Results

Model size

Benchmarks

Qualitative Examples — GRPO vs. AXPO

Perception VisualProbe-hard · GT 2563 1 / 3

Q. "What is the number to the left of 1000?"

A busy Hong Kong street scene with tiny telephone-number signs.

GRPO 999 ✗

Step 0 — think. "… checking traffic lights … license plates … bus route numbers … none match 1000 … maybe the question is a trick. The number to the left of 1000 in numerical order is 999 … Let me try image_zoom_in … [reasoning continues without ever emitting a tool call] … Given that this is a common math question, the answer is likely 999."

No tool call was emitted.

Answer. 999

AXPO 2563 ✓

Step 0 — think. "… checking phone numbers and license plates … under SUNBEAM THEATRE there's a section with 'Tel: 2563 1000' … Let me confirm with image_zoom_in."

Step 0 — tool call.

image_zoom_in({"bbox_2d": [870, 450, 945, 517],
               "label": "Telephone number sign", "img_idx": 0})

Step 1 — think. "The cropped image clearly shows 'Tel: 2563 1000'. The number to the left of 1000 is 2563."

Answer. 2563

Both prefixes mention image_zoom_in as a candidate, but only AXPO commits. GRPO talks itself into a numerical-order guess and never invokes the tool — same pattern across all four GRPO rollouts.

Reasoning MathVision · GT 8 2 / 3

Q. "Three semi-circles as shown. ABEF is a rectangle and the radius of each semi-circle is 2 cm. E and F are the centers of the bottom semi-circles. The area of the shaded region (in cm²) is:"

Three overlapping semi-circles inscribed in rectangle ABEF.

GRPO 4 ✗

Step 0 — think. "… Top semicircle area = 2π. Each side gives a sector of π/2 minus a right triangle of area 2, so each segment is π−2. Subtracting both: 2π − 2(π−2) = 4 … so the shaded area is 4."

No tool calls were emitted.

Answer. 4

AXPO 8 ✓

Step 0 — think. "… Setting E=(0,0), F=(4,0), top semicircle has center (2,2) and radius 2. Sector minus triangle gives an overlap of π−2 … Shaded = 2π − 2(π−2) = 4 … Let me also visualise the configuration to confirm."

Step 0 — tool call.

python({"code": "import matplotlib.pyplot as plt;
import numpy as np; ax.plot(x1, np.sqrt(4-x1**2), 'b');
ax.plot(x3, 2-np.sqrt(4-(x3-2)**2), 'r'); …"})

Matplotlib output showing semicircle configuration.

Step 1 — think. "… the bottom semicircles curve upward (forming quarter-disks at corners E,F, total 2π), the top semicircle curves downward from AB (a half-disk of area 2π). Areas match — by area-swap the shaded region rearranges to the full rectangle ABEF: 4 × 2 = 8."

Answer. 8

The algebra is sound but the geometry is misread. AXPO's python visualization makes the area-swap obvious; GRPO stays in algebra and commits to the wrong sub-region (4/4 rollouts: 4, 4, 4, 2π).

Search HR-MMSearch · GT 1984 3 / 3

Q. "In which year was the mayor, who will be serving in August 2025, born in the city where the fashion designer of this fashion show was born?"

Fashion show backdrop reading PAUL COSTELLOE.

GRPO 1945 ✗

Step 0 — think. "Designer is Paul Costelloe; born in Dublin in 1945. Mayor of Dublin in 2025 — the search results don't directly show this, but if the riddle is that the mayor is the same age as the designer, the answer would be 1945."

Step 0 — tool call.

tavily-search({"query": "Paul Costelloe birthplace", …})

Tool output. "… Born in Dublin, Ireland … Paul Costelloe (23 June 1945 – 21 November 2025) …"

Step 1 — think. "… search returned the designer's info but not the mayor's. I'll commit to 1945 since the designer was 80 in 2025 and the question may be a riddle."

No second-hop search.

Answer. 1945

AXPO 1984 ✓

Step 0 — think. "Backdrop reads PAUL COSTELLOE; designer is Paul Costelloe. I need (i) his birthplace, (ii) the mayor of that city in August 2025, (iii) the mayor's birth year."

Step 0 — tool call.

tavily-search({"query": "Paul Costelloe birthplace",
              "max_results": 5})

Tool output. "… Born in Dublin, Ireland … Paul Costelloe (23 June 1945 – 21 November 2025) …"

Step 1 — think. "Birthplace = Dublin. Now the Lord Mayor of Dublin in August 2025."

Step 1 — tool call.

tavily-search({"query": "current mayor of Dublin
              Ireland 2024", "max_results": 5})

Tool output. "Ray McAdam (born 1984) … has served as Lord Mayor of Dublin since June 2025."

Answer. 1984

The two-hop structure (designer → city → mayor → birth year) requires a follow-up tavily-search after the first hop. AXPO consistently issues it; all four GRPO rollouts stop after the first hop and fall back to riddle-style guesses (1945, 1963, 1968).

Agent eXplorative Policy Optimization

TL;DR

Method

Highlight

(a) Only AXPO lifts both tool usage and accuracy

(b) AXPO narrows the agentic gap to a 4× larger baseline

Full Results

Qualitative Examples — GRPO vs. AXPO