Multimodal reranking

I would like to talk about trivial applications with strong subjective evaluation criteria, that make them non-trivial problems. My team started recently working on a project that builds a search engine over a set of house pictures, in which user would enter a query (like “kitchen with blue walls”) and the page would return a series of pictures that fits that phrase. One can think that is not quite a challenge, as the engineering part would require:

Turn the pictures into a vector representation, based on multimodal embeddings using algorithms as CLIP, SigLIP, BLIP, for example.
Ingest these vectors into a vector database, in which you can look up the most similar image vectors given the input vector for the search query. As the embedding model is multimodal, the encoding of both pictures and texts are compatible.

My first iteration was quick, I created a project with my visual studio (with the help of copilot), I created a vector database using FaISS encoding the pictures with OpenCLIP model (I used a medium size for accuracy-sake, to be more specific a 1024 component vector based on “xlm-roberta-large-ViT-H-14”). By the way, I tested afterwards also SigLIP 2 which I liked more. For the front-end part I created a streamlit application. So far, everything was nice, my application was working, the search engine worked well, but …

But, for this application, the grading of search query results is quite subjective. I mean when I seek “A dining room with red walls and three cozy sofas”, what if there is not any picture matching all features, what if there only pictures with red walls but no sofas, and vice versa. How can I rank these results in order to meet end user’s preferences?. Well, this issue is solved with another model called “reranker”.

Reranker

Reranker models have been solved, quite well for the single-modal domain (text) as you can see in sentence transformer docs, for multimodal there are also approaches but our problem would demand a custom reranking model. So we started several paths to solve the question:

Fine tune the original CLIP model
Create a reranker model on top of the CLIP model

Second solution is preferable over the first, as we can store the picture embeddings in the original form and turn into new representations (be this project or another in the future). In both of them the result is a modified vector representation.

Human labor (for labeling) is always scarce

If you don’t have people (or money) enough use LLMs for creating synthetic datasets, but remember devote at least some human quality time for building the validation set (as a friend of mine would say do not your model in hands of endogamy).

First thing was to create a “human feedback” dataset, as we were in the PoC phase of the project we decided to use a LLM as a tutor model to simulate the human feedback. We used GPT4o to bootstrap a set of 12k queries like “bedroom with a red carpet”, etc, and then we used the original vector index to search several instances for another LLM prompt to evaluate the fit of the query to the picture. We followed a strategy of selecting the most similar 10 pictures for each query (discarding duplicates using perceptual hash) and another 10 pictures taken randomly in order to have also hard negatives. The results was a dataset with human feedback in which we had a text query, a picture and a grading from 0 to 100 indicating the match of the query.

Based on the latter dataset we fine tuned the model using two different loss functions:

Cross entropy (XE) which is the original loss function for the CLIP model, basically (in a quick and dirty way of speaking) it minimizes the discrepancy of text and picture embedding representation.
InfoNCE, noise contrastive loss, not as popular as XE but also usual in contrastive fit problems.
DPO-like fine tune, I created the loss function myself. The model is trained in a way that given a pair of pictures and a search query, the model would prefer the same picture (over the another) as the human did. In the train process loss correct the weights when preferences don’t matches, using a regularization term to keep as much as possible the overall the original output distribution function in the CLIP model. For simplicity sake I move all the technicalities to another post: Application of DPO to multimodal classification

The second approach was inspired in latest alignment method for LLMs, which worked well in terms of replicating out distribution function shape, but when tested with a actual “human evaluation set”, composed of 300 instances of picture-text queries done by myself, the performance of cross entropy was significantly better.

Well, so far I have a model able to generate an improved embedding representation for my problem, therefore the search engine based on it work indeed better than the original OpenClip. So, if I know a better representation, why not learning a model able to replicate a transformation of original clip vector to this better performing one?.

Alignment is a very subjective matter

All the technical bit being said, please remember that alignment is a very subjective matter. So, to do it properly you should take into consideration user’s preferences (what complicates the solution quite a lot).

Final optimization

For the new model I created a 3 layer MLP (32 bit), that was able to perform such a transformation without losing much precision. In terms of model size my new model was 40Mb instead of 4Gb that the OpenClip original model was. So mission, accomplished?, not yet. My second guess, once I validated the feasibility of the method, was check if a 8-bit quantized version of MLP could work well. And the answer was yes, after a reduction of 4 fold of model size (to 10 Mb) the output vectors didn’t degrade the overall performance of the search engine. That was great news as I could use this small footprint model dynamically in each query improving the original model.

I can give you tome technical details on the architecture, which by the way was eventually quite simple (after a bunch of tries though). It was a three‑layer MLP $g_ϕ : R^{2048} \to R^{2048}$ operates on the concatenated embeddings with the following architecture:

Input. Concatenation of image and query embeddings
$z_{0} = [v; t] \in R^{2048}$ .
Loss function: symmetric cosine.

Layers:

Layer	Dimensionality	Activation
Linear 1	2048 → 2048	GELU
Linear 2	2048 → 2048	GELU
Linear 3	2048 → 2048	GELU + Dropout 0.1
LayerNorm	2048	—
In torch code:

def __init__(self, feat_dim: int = 1024):
	super().__init__()
	D = feat_dim*2
	self.layer1 = nn.Sequential(
	nn.Linear(feat_dim * 2, feat_dim * 2),
	nn.GELU(),
	)
 
	self.layer2 = nn.Sequential(
		nn.Linear(feat_dim * 2, feat_dim * 2),
		nn.GELU(),
		)
	
	self.out = nn.Sequential(
		nn.Linear(D, D),
		nn.GELU(), # non-linearity
		nn.Dropout(0.1), # <- uncomment if you need regularisation
		)
	self.norm = nn.LayerNorm(feat_dim * 2) # final stabiliser

David Rey

Explorer

Multimodal reranking

Reranker

Final optimization

Graph View

Table of Contents

Backlinks

Latest Posts

From sandboxed to boardroom

Hybrid crews

The microshift revolution

Supply chain copilots

Opportunity or Squeeze