Unified Reinforcement and Imitation Learning for Vision Language Models

Importance of RIL. Vision-language models (VLMs) have rapidly emerged from the success of instruction-tuned large language models, enabling multimodal understanding by generating human-like, visually grounded text responses . Yet most advances—whether by scaling up model size, enlarging instruction datasets, or adding architectural and “think-answer” reasoning modules—dramatically increase inference latency and memory demands, making such powerful VLMs impractical for mobile or other resource-constrained environments . To overcome these challenges, we introduce Unified Reinforcement and Imitation Learning (RIL), a training algorithm that forgoes architectural changes and verbose reasoning steps by teaching a compact “student” VLM to emulate a larger “teacher” through an LLM-based discriminator that issues similarity rewards, combined with reinforcement signals for factual accuracy, thus yielding lightweight models with state-of-the-art performance and low latency.

Figure 1: Showing the performance improvements (%) of Qwen2.5-VL-7B [1] across vision-language evaluation benchmarks and the average scores for 14 evaluation benchmarks used in Table 1. Note that, we conduct RL on GRPO and advanced GRPO, Dr.GRPO, with only answer rewards from LLM-as-a-Judge and we present RIL based on similarity rewards from single or multi large teacher VLMs and simultaneously answer rewards.

Related Works. Imitation Learning (IL) in robotics aims to replicate expert behavior. Generative Adversarial Imitation Learning (GAIL) is a key framework that uses a generator to mimic expert actions/trajectories and a discriminator to distinguish them. GAIL utilizes adversarial training to align the generator's behavior with the expert's, evaluating performance via discriminator scores without explicit reward functions. A modified our approach, Reinforcement and Imitation Learning (RIL), is proposed for Vision and Language Models (VLMs) with four key modifications: combining GRPO and GAIL with explicit reward design, stabilizing output scores, using an LLM-as-a-Judge for reward assessment, and updating student VLMs with generated text responses from teacher VLMs via GRPO. RIL stabilizes training, allows student VLMs to potentially outperform teacher VLMs, and doesn't require explicit think-answer processes.

Figure 2: Comparing RIL-applied VLMs based on multi large VLMs with diverse open- and closedsource VLMs, under average performance of numerous vision-language evaluation benchmarks.

Rewards to Compute: RIL uses two binary reward signals for each generated response o_i to a prompt q:

Similarity Reward: 1(D(q, o_i) < 0.5) where D is the discriminator trained to distinguish student from teacher outputs, assigning one score below 0.5 when the student's responses is similar with teacher's ones.
Answer Reward: LLM-as-a-Judge(q, a, o_i) where the LLM-Judge evaluates the factual correctness of the response o_i given the question q and the ground-truth answer a, awarding 1 if correct.

Figure 3: Illustrating training dynamics of small VLMs during RIL, where (Left) showing the evolution of similarity rewards over training iterations and (Mid) accuracy rewards obtained using LLM-as-a-Judge, ensuring that generated responses are both contextually appropriate and factually correct. (Right) displaying the overall average performance of evaluation benchmarks

Training Overview: RIL training begins by warm-starting the student VLM via supervised fine-tuning and pretraining a binary discriminator to distinguish student from teacher outputs; thereafter, each training batch alternates between (1) rolling out a set of student responses alongside cached teacher responses, (2) updating the discriminator on these mixed examples to sharpen its ability to tell them apart, (3) computing a combined reward for each response—comprising the discriminator's similarity score plus an LLM-as-a-Judge's factuality assessment—and (4) updating the student policy via Dr.GRPO (a PPO-style algorithm with clipping and a KL penalty) to maximize expected cumulative rewards, all executed with vLLM for fast generation and a multi-GPU DeepSpeed setup—thereby endowing a compact student model with both teacher-like style and high answer accuracy without added inference latency.

Figure 4: Comparing RIL-applied VLMs with large size open-source and closed-source VLMs.

Algorithm. The algorithm starts by loading a pre-trained student model and a pre-trained binary discriminator, as well as a frozen set of teacher-generated responses. For each batch of image-question pairs, it first makes a copy of the student model and uses it to generate a small set of candidate rollout answers, while retrieving the corresponding teacher answers from the cache memory. It then refines the discriminator by training it to tell apart the student's outputs from the teacher's over several passes. Once the discriminator is sharpened, the algorithm scores all student and teacher outputs using both the discriminator's judgment and an external language-model judge, turning these scores into reward signals. Finally, it updates the student model itself—again over several passes—using a reinforcement-style optimizer that seeks to maximize those combined rewards.

Limitation: RIL is the post-instruction-tuning alignment phase, and it has not yet been integrated into the very first visual instruction-tuning stage—meaning its discriminator-driven alignment benefits aren't leveraged early on in pre-training-like early stage, which could further improve model instruction following.