VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

Summary: The recent surge in high-quality visual instruction tuning samples from closed-source vision-language models (VLMs) such as GPT-4V has accelerated the release of open-source VLMs across various model sizes. However, scaling VLMs to improve performance using larger models brings significant computational challenges, especially for deployment on resource-constrained devices like mobile platforms and robots. To address this, we propose

VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes, which prioritizes efficiency without compromising accuracy.

VLsI leverages a unique, layer-wise distillation process, introducing intermediate "verbalizers" that map features from each layer to natural language space, allowing smaller VLMs to flexibly align with the reasoning processes of larger VLMs. This approach mitigates the training instability often encountered in output imitation and goes beyond typical final-layer tuning by aligning the small VLMs' layer-wise progression with that of the large ones. We validate

VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V without the need for model scaling, merging, or architectural changes.

Motivation: Open-source vision-language models (VLMs) like LLaVA-OneVision and Qwen2-VL have shown performance improvements by scaling such as 72B, but their computational demands hinder deployment in resource-constrained environments like mobile devices and robots. Designing efficient VLMs capable of complex tasks without heavy hardware requirements remains a key challenge. Previous methods, such as adding specialized modules or modifying architectures, increase engineering complexity and struggle with advanced visual reasoning tasks, as shown in recent complex benchmarks such as MM-Vet and MMMU. This prompts the question of whether comparable or superior performance can be achieved without scaling or structural modifications.

Figure 1: Performance overview of VLsI on vision-language benchmarks. (a) Accuracy on MM-Vet for various model sizes, showing that VLsI (2B and 7B) achieves competitive performance compared to proprietary closed-source VLMs. (b) Comparative evaluation on multiple challenging benchmarks, where VLsI (green and blue) outperforms leading closed-source VLMs, including GPT-4V, Claude-3.5-Sonnet, and Gemini-1.5-Pro, highlighting its efficiency and effectiveness across diverse tasks.

Method: We present

VLsI: Verbalized Layers-to-Interactions, a new VLM family that leverages an innovative, natural language-based distillation process to efficiently transfer knowledge from large to small VLMs. Unlike traditional distillation methods, which often directly imitate outputs from a larger model,

VLsI introduces a layer-wise approach where each intermediate layer generates verbal responses in natural language space, enhancing interpretability and alignment with larger models. This is achieved through a three-step process: (1) the verbalization step, which uses "verbalizers" to project intermediate features into the language space, making them interpretable as text-based responses; (2) the interaction step, which performs adaptive layer matching to align the reasoning progression between large and small VLMs; and (3) the reinforcement step, which finetunes the distilled VLMs for task-specific instruction-following responsiveness.

Figure 2: Illustration of the training process in VLsI, showing (a) the verbalization step and (b) the interaction step. (a) In the verbalization step, intermediate layers in both the large- and small-backbone VLMs are equipped with a "verbalizer", allowing their outputs to be projected into natural language space. Autoregressive loss is applied to align these verbalized outputs with the target responses. (b) In the interaction step, each intermediate layer in the small-backbone VLM searches for a matching layer in the large backbone VLM within a specified range. For example, once the 2nd layer of the small VLM is matched with the 4th layer in the large VLM, the next matching search for the 3rd layer in the small VLM will proceed from the 5th to the 7th layers of the large VLM, ensuring progressive alignment.

Contribution: We validate

VLsI's effectiveness across ten challenging benchmarks, demonstrating significant performance gains of 11.0% (2B model) and 17.4% (7B model) over GPT-4V. Notably, these improvements are achieved without increasing model size, merging modules, or modifying the architecture. Consequently, it is able to make VLMs a practical and deployable solution for on-device applications in resource-constrained environments. Furthermore,

VLsI is easy to implement and adaptable across different model architectures, showing significant gains not only with Qwen2-VL but also with LLaVA-OV, where it achieves a 19.7% improvement in 2B and 7B model sizes (Qwen2-VL), and a 34.5% improvement in 0.5B and 7B model sizes (LLaVA-OV) on challenging benchmarks like MMB, MM-Vet, and MMMU.

👆 Click to Navigate Text Generation Quality

Verbalization Example: Figure 3 illustrates the verbal responses generated at each intermediate layer in small-backbone VLM and

VLsI. Using the verbalized outputs to trace each layer's interpretive progression, this comparison highlights how both models gradually enhance understanding across layers. At the shallower layers, both models generate basic descriptions, focusing on large, simple shapes and colors. However, as

VLsI progresses to mid-level layers, it begins to recognize and articulate more complex visual structures, such as labeled shapes and their relative positions. In contrast, the small-backbone VLM's verbal responses remain relatively vague or repetitive, often lacking in specific relational details. By the deeper layers,

VLsI demonstrates a clear advantage: its verbalizations shift towards identifying the correct pattern, explicitly referring to shapes and colors in alignment with the target response: "star with a dot". Meanwhile, the small-backbone VLM incorrectly predicts the missing image as a "diamond with a dot", failing to capture the specific pattern. This example underscores the effectiveness of

VLsI's layer-wise verbalization, where each stage of verbal responses helps the small-backbone VLM align with the larger one.

Figure 3: Example of verbalized outputs from each intermediate target layer in an alternative small-backbone VLM (without VLsI enhancements) and the VLsI. The visual question prompts VLM to predict the missing image in a sequence pattern. The outputs illustrate how each layer progressively interprets the visual cues, with VLsI accurately identifying the answer as 'a star with a dot' in the final layer, while the alternative small-backbone VLM incorrectly predicts 'a diamond with a dot'. This demonstrates the improved interpretative capability of VLsI through layer-wise, language-based distillation.

Importance of Large VLM's Performance: Figure 4 provides the challenging evaluation benchmarks' performances: MM-Vet and MMMU. Each cell represents the performances on their evaluation benchmarks, where the orange colored-values represent LLaVA-OV-based

VLsI's result and the purple ones represent Qwen2-VL-based result. This figure reveals consistent trends that using large- and small-backbone VLMs with more bigger model sizes enhances

VLsI's performances across all configurations.

Figure 4: Comparison of performance on MM-Vet and MMMU across different model size combinations in large and small backbone VLMs. Each cell shows the evaluation results for various interaction configurations between 0.5B, 2B, and 7B small backbone VLMs trained with either Qwen2-VL or LLaVA-OV as the large-backbone VLM.

Matching Statistics: Figure 5 illustrates, as the interaction step progresses, the small-backbone VLM gradually tries to learn about deeper layers' responses of the large-backbone VLM, which can be considered accelerating the process of reaching an answer.

Figure 5: Distribution changes of the matched indices between small-backbone and large-backbone VLMs at the interaction step. The left figure shows the distribution at the beginning of training, while the right figure shows it at the end.

Limitation: The large and small-backbone VLMs must share the same tokenizer and token index order when constructing VLsI. We will explore more general ways that accommodate different tokenizers and token index orders, potentially expanding

VLsI's applicability and scalability.