Test-time Adaptation for Cross-modal Retrieval with Query Shift

Haobin Li, Peng Hu, Qianjun Zhang, Xi Peng, Xiting Liu, Mouxing Yang
ICLR 2025 (Spotlight)
Figure 1

Figure 1: (a) Dominant Paradigm: the pre-trained models embrace powerful zero-shot retrieval capacity and could be fine-tuned on domain-specific data for customization, which has emerged as the dominant paradigm for cross-modal retrieval. (b) Query Shift: the performance of the paradigm would be significantly degraded when encountering the query shift problem. On the one hand, collecting sufficient data to tailor the pre-trained models for scarce domains is daunting and even impossible. On the other hand, as the saying goes, “Different strokes for different folks”, even fine-tuned models cannot accommodate all personalized domains. (c) Observations: we study the query shift problem for cross-modal retrieval and reveal the following observations. Namely, query shift not only diminishes the uniformity of the query modality but also amplifies the modality gap between the query and gallery modalities, undermining the well-structured common space inherited from pre-trained models.

Abstract

The success of most existing cross-modal retrieval methods heavily relies on the assumption that the given queries follow the same distribution of the source domain. However, such an assumption is easily violated in real-world scenarios due to the complexity and diversity of queries, thus leading to the query shift problem. Specifically, query shift refers to the online query stream originating from the domain that follows a different distribution with the source one. In this paper, we observe that query shift would not only diminish the uniformity (namely, within-modality scatter) of the query modality but also amplify the gap between query and gallery modalities. Based on the observations, we propose a novel method dubbed Test-time adaptation for Cross-modal Retrieval (TCR). In brief, TCR employs a novel module to refine the query predictions (namely, retrieval results of the query) and a joint objective to prevent query shift from disturbing the common space, thus achieving online adaptation for the cross-modal retrieval models with query shift. Expensive experiments demonstrate the effectiveness of the proposed TCR against query shift.

Observation

Figure obs

Method

Figure 2

Overview of the proposed TCR. For the given online queries, the modality-specific encoders are employed to project the query and gallery samples into the latent space established by the source model. The obtained query embeddings and gallery embeddings are passed into the query prediction refinement module. In the module, TCR first selects the most similar gallery sample for each query and obtain the query-gallery pairs. After that, the pairs with higher uniformity and lower modality gap are chosen to estimate the filtering threshold of query predictions and modality gap of the source model as the constraints for the adaptation. Finally, three loss functions are employed to achieve robust adaptation for cross-modal retrieval with query shift.


Datasets

To investigate the influence of cross-modal retrieval with query shift, we employ the following two settings for extensive evaluations:

Figure benchmark

Notably, we only introduce the corruptions to the query modality in the QS setting, e.g., for image-to-text retrieval, the distribution shifts occur on the image modality. The cases of the 16 image corruptions and 15 text corruptions are visualized in the following figures.


Comparison

Figure exp1

Visualization Result

Here are some visualization results. The proposed method not only enlarge intra-modality uniformity but also ​reduces the modality gap, thereby ​enhancing cross-modal retrieval performance.

Figure Visualization

Real-word Examples

We also conducted experiments on ​personalized queries, which is mentioned in the motivation. Specifically, we perform TTA on different needs of users in the ​e-commerce domain. The results demonstrate that our approach achieves ​consistent performance improvements in both Image-to-Text or ​Text-to-Image retrieval.

Figure real

Poster

BibTeX

@inproceedings{li2025test,
          title={Test-time Adaptation for Cross-modal Retrieval with Query Shift},
          author={Haobin Li, Peng Hu, Qianjun Zhang, Xi Peng, Xiting Liu, Mouxing Yang},
          booktitle={International Conference on Learning Representations},
          year={2025}
        }