A new training approach for multimodal embedding models has shown dramatic improvements in document retrieval tasks. Researchers at Hugging Face demonstrated this by fine-tuning a vision-language model for Visual Document Retrieval (VDR), where the goal is to match text queries with relevant document images containing charts, tables, and layouts.
The team used the Qwen/Qwen3-VL-Embedding-2B model, a general-purpose multimodal embedding system trained on diverse data. While such models perform well across many tasks, they are not optimized for specific applications. Fine-tuning on domain-specific data significantly improves performance. In this case, the fine-tuned model achieved an NDCG@10 score of 0.947, compared to 0.888 for the base model. It also outperformed larger models up to four times its size in tests.
The training process involved several key components. The SentenceTransformerTrainer framework was used, which supports both text-only and multimodal models. The main difference is that multimodal datasets include images or other modalities alongside text. The processor automatically handles image preprocessing, simplifying the workflow.
Researchers fine-tuned the model using a Visual Document Retrieval dataset, where text queries like "What was the company's Q3 revenue?" were matched against document screenshots. The training pipeline included a loss function to guide optimization and an evaluator to track performance. The fine-tuned model, named tomaarsen/Qwen3-VL-Embedding-2B-vdr, is now available on GitHub for public use.
This method highlights the importance of domain-specific fine-tuning for specialized tasks. While general-purpose models are versatile, they often fall short in precision compared to models tailored to specific use cases.
Source: huggingface.co