Importance of RIL.
Vision-language models (VLMs) have rapidly emerged from the success of instruction-tuned large language models, enabling multimodal understanding by generating human-like, visually grounded text responses
. Yet most advances—whether by scaling up model size, enlarging instruction datasets, or adding architectural and “think-answer” reasoning modules—dramatically increase inference latency and memory demands, making such powerful VLMs impractical for mobile or other resource-constrained environments
. To overcome these challenges, we introduce Unified Reinforcement and Imitation Learning (RIL), a training algorithm that forgoes architectural changes and verbose reasoning steps by teaching a compact “student” VLM to emulate a larger “teacher” through an LLM-based discriminator that issues similarity rewards, combined with reinforcement signals for factual accuracy, thus yielding lightweight models with state-of-the-art performance and low latency.