GenRecal: Generation after Recalibration from Large to Small Vision Language Models

Summary: Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types—differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs.

Figure 1: (Left) Visualizing the token indices of a given image and text prompt and representing the possibility of distillation among various VLM pair combinations, comparing traditional distillation with our proposed distillation framework, GenRecal. Note that, the parentheses mean each VLM's LLM tokenizer, '...' indicates the placement of image features, and the number of these features varies depending on the image embedding strategy. (Right) Comparing the performance of a challenging evaluation benchmark, MM-Vet, with [A] baseline, [B] SFT on the baseline, [C] traditional distillation and [D] GenRecal from same token types of large VLMs, and GenRecal with more powerful [E] teacher and [F] student VLMs.

Why is GenRecal really needed? GenRecal plays a critical role in advancing vision-language model (VLM) distillation by addressing a key limitation of existing methods: incompatibility between teacher and student models with different token types. Traditional distillation approaches often fail when the models do not share the same vocabulary, token splitting strategies, or token index orders. GenRecal overcomes this by introducing a Recalibrator that aligns and adapts the feature representations between heterogeneous VLMs, enabling effective knowledge transfer even when their token types differ. Experimental results show that GenRecal not only outperforms traditional distillation methods under the same token types but also allows for greater flexibility in choosing more powerful teacher models, leading to stronger student VLMs. This compatibility and performance boost demonstrate GenRecal's significance in enabling scalable and general-purpose VLM distillation across diverse architectures.

Figure 2: (Left) Comparison of the challenging benchmark performances, MMB, MM-Vet, MMMU, and MMMU-Pro by changing teacher vision-language models (VLMs) to distill the knowledge into small VLMs. Notably, the more powerful the teacher VLMs we select, the greater the performance improvement we can achieve. (Right) Comparing the performance of the challenging benchmark: MMMU, with GenRecal and various vision-language models across model sizes.

Contribution:

A New Efficient VLM Family: We introduce efficient VLM family, Generation after Recalibration (GenRecal), which consistently outperforms both open- and closed-source VLMs on challenging benchmarks.
Token Types-compatible Recalibration: GenRecal employs Recalibrator to align and adapt feature representations of large and small VLMs, enabling general-purpose distillation across various token types of vocabulary size, token splits, and token index ordering.
Broad Applicability: GenRecal is compatible with a wide range of VLM architectures across different model sizes, overcoming the challenge of selecting large VLMs that have different token types and demonstrating its practicality for real-world deployment in resource-constrained settings.

Figure 3: We explore the range of distillation combinations between teacher and student VLMs using two approaches: (a) traditional distillation and (b) our proposed model, GenRecal. Unlike traditional distillation—which support only a limited set of pairings—GenRecal offers the flexibility to select any model for distillation, thereby enabling a more versatile and comprehensive distillation framework.

Training Overview: The training of GenRecal follows a three-stage process designed to enable general-purpose distillation across heterogeneous vision-language models (VLMs). In the first stage, the Recalibrator is trained while freezing all parameters of both large and small VLMs. During this phase, the model minimizes autoregressive loss using ground-truth answers and KL divergence between the recalibrated logits and the original logits from the large VLM, thereby aligning feature representations across models. A key regularization strategy is applied here to prevent the feature deviation from the large VLM, which is crucial for achieving effective distillation. In the second stage, distillation is further refined by jointly training the Recalibrator and the small VLM’s body using both the original losses and an additional supervised fine-tuning (SFT) loss, helping the small VLM better learn the shared feature space. Finally, in the third stage, all components except the vision encoder are fine-tuned with continued SFT to enhance instruction-following capabilities and solidify the knowledge transfer from the large VLM.

Figure 4: Overview of GenRecal architecture and its training stages. We let q_s and a_s denote small VLM's embedded tokens (i.e., image and text tokens are included together) for question and answer in visual instruction tuning dataset. In addition, q_l and a_l are denoted by the large VLM's embedded tokens. Note that, vision-related modules such as vision encoder and projector for image tokens are omitted in this figure.

Experiments:

Effectiveness Across Model Sizes: As described in the manuscript, GenRecal consistently outperforms baseline and smaller-scale models, demonstrating its generalization across different student VLM sizes. Using stronger student VLMs such as InternVL2.5-8B notably enhances distillation performance, whereas smaller student VLMs tend to underperform, even when the teacher VLM remains the same.
Impact of Teacher VLM Strength: Experimental results confirm that stronger teacher VLMs lead to better performance in the student models. This trend holds across various teacher-student combinations, suggesting that the strength of the teacher VLM is a crucial factor in successful knowledge distillation.
Role of Recalibrator in Feature Alignment: Visualization shows that Recalibrator effectively aligns the feature representations of student VLMs to those of teacher VLMs over training. Loss comparisons further support that Recalibrator enables shared feature representations necessary for general-purpose distillation.

Figure 5: An overview of our training pipeline, illustrating both the question prompt and the measurement/legend annotations (top), followed by t-SNE visualizations (bottom) of teacher and student VLM pairings at the initial and final training stages. The question prompt (upper-left) shows the format of the question, while the measurement and legend box (upper-right) shows key model components to measure. Each scatter plot in the lower panels corresponds to a different combination of teacher and student VLM sizes, capturing how the learned representations evolve from early to later training iterations.

Limitation: While GenRecal demonstrates significant performance improvements over traditional distillation methods—even under the same token types between teacher and student VLMs—it still relies on the use of the large VLM's VLM-head for knowledge transfer. This dependency may limit flexibility when access to the full large model is restricted, such as in partially open-source or API-based settings. Moreover, the current GenRecal framework focuses on distilling final-layer features, potentially missing out on finer-grained knowledge captured in intermediate layers. Although it supports general-purpose distillation across token types, further work is needed to extend its capabilities to multi-source teacher scenarios and sequential knowledge alignment for richer, hierarchical distillation.