VLsI: Verbalized Layers-to-Interactions
VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes, which prioritizes efficiency without compromising accuracy.
VLsI leverages a unique, layer-wise distillation process, introducing intermediate "verbalizers" that map features from each layer to natural language space, allowing smaller VLMs to flexibly align with the reasoning processes of larger VLMs.
This approach mitigates the training instability often encountered in output imitation and goes beyond typical final-layer tuning by aligning the small VLMs' layer-wise progression with that of the large ones.
We validate
VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V without the need for model scaling, merging, or architectural changes.
VLsI on vision-language benchmarks.
(a) Accuracy on MM-Vet for various model sizes, showing that
VLsI (2B and 7B)
achieves competitive performance compared to proprietary closed-source VLMs.
(b) Comparative evaluation on multiple challenging benchmarks,
where
VLsI (green and blue) outperforms leading closed-source VLMs,
including GPT-4V, Claude-3.5-Sonnet, and Gemini-1.5-Pro, highlighting its efficiency and effectiveness across diverse tasks.
VLsI: Verbalized Layers-to-Interactions,
a new VLM family that leverages an innovative, natural language-based distillation process to efficiently transfer knowledge from large to small VLMs.
Unlike traditional distillation methods, which often directly imitate outputs from a larger model,
VLsI introduces a layer-wise approach where each intermediate layer generates verbal responses in natural language space, enhancing interpretability and alignment with larger models.
This is achieved through a three-step process:
(1) the verbalization step, which uses "verbalizers" to project intermediate features into the language space,
making them interpretable as text-based responses;
(2) the interaction step, which performs adaptive layer matching to align the reasoning progression between large and small VLMs;
and (3) the reinforcement step, which finetunes the distilled VLMs for task-specific instruction-following responsiveness.
VLsI,
showing (a) the verbalization step and (b) the interaction step.
(a) In the verbalization step, intermediate layers in both the large- and small-backbone VLMs are equipped with a "verbalizer",
allowing their outputs to be projected into natural language space.
Autoregressive loss is applied to align these verbalized outputs with the target responses.
(b) In the interaction step, each intermediate layer in the small-backbone VLM searches for a matching layer in the large backbone VLM
within a specified range. For example, once the 2nd layer of the small VLM is matched with the 4th layer in the large VLM,
the next matching search for the 3rd layer in the small VLM will proceed from the 5th to the 7th layers of the large VLM, ensuring progressive alignment.
VLsI's effectiveness across ten challenging benchmarks,
demonstrating significant performance gains of 11.0% (2B model) and 17.4% (7B model) over GPT-4V.
Notably, these improvements are achieved without increasing model size, merging modules, or modifying the architecture.
Consequently, it is able to make VLMs a practical and deployable solution for on-device applications in resource-constrained environments.
Furthermore,
VLsI is easy to implement and adaptable across different model architectures,
showing significant gains not only with Qwen2-VL but also with LLaVA-OV, where it achieves a 19.7% improvement in 2B and 7B model sizes (Qwen2-VL),
and a 34.5% improvement in 0.5B and 7B model sizes (LLaVA-OV) on challenging benchmarks like MMB, MM-Vet, and MMMU.
VLsI. Using the verbalized outputs to trace each layer's interpretive progression,
this comparison highlights how both models gradually enhance understanding across layers.
At the shallower layers, both models generate basic descriptions, focusing on large, simple shapes and colors.
However, as
VLsI progresses to mid-level layers,
it begins to recognize and articulate more complex visual structures, such as labeled shapes and their relative positions.
In contrast, the small-backbone VLM's verbal responses remain relatively vague or repetitive, often lacking in specific relational details.
By the deeper layers,
VLsI demonstrates a clear advantage: its verbalizations shift towards identifying the correct pattern,
explicitly referring to shapes and colors in alignment with the target response: "star with a dot".
Meanwhile, the small-backbone VLM incorrectly predicts the missing image as a "diamond with a dot",
failing to capture the specific pattern. This example underscores the effectiveness of
VLsI's layer-wise verbalization,
where each stage of verbal responses helps the small-backbone VLM align with the larger one.
VLsI enhancements) and the
VLsI.
The visual question prompts VLM to predict the missing image in a sequence pattern.
The outputs illustrate how each layer progressively interprets the visual cues,
with
VLsI accurately identifying the answer as 'a star with a dot' in the final layer,
while the alternative small-backbone VLM incorrectly predicts 'a diamond with a dot'.
This demonstrates the improved interpretative capability of
VLsI through layer-wise,
language-based distillation.
VLsI's result
and the purple ones represent Qwen2-VL-based result.
This figure reveals consistent trends that using large-
and small-backbone VLMs with more bigger model sizes enhances
VLsI's performances
across all configurations.
VLsI's applicability and scalability.