Verbalization Example:
Figure 3 illustrates the verbal responses generated at each intermediate layer in small-backbone VLM
and
VLsI. Using the verbalized outputs to trace each layer's interpretive progression,
this comparison highlights how both models gradually enhance understanding across layers.
At the shallower layers, both models generate basic descriptions, focusing on large, simple shapes and colors.
However, as
VLsI progresses to mid-level layers,
it begins to recognize and articulate more complex visual structures, such as labeled shapes and their relative positions.
In contrast, the small-backbone VLM's verbal responses remain relatively vague or repetitive, often lacking in specific relational details.
By the deeper layers,
VLsI demonstrates a clear advantage: its verbalizations shift towards identifying the correct pattern,
explicitly referring to shapes and colors in alignment with the target response: "star with a dot".
Meanwhile, the small-backbone VLM incorrectly predicts the missing image as a "diamond with a dot",
failing to capture the specific pattern. This example underscores the effectiveness of
VLsI's layer-wise verbalization,
where each stage of verbal responses helps the small-backbone VLM align with the larger one.