A team of researchers from Apple has outlined a new approach to improve how AI models learn from mixed data sources. Their work, accepted for presentation at the Workshop on Navigating and Addressing Data Problems for Foundation Models at ICLR 2026, focuses on data-mixture optimization for multimodal pretraining.
Current methods often adjust training mixtures based on limited criteria such as data format or task type. This narrow focus can limit the model’s ability to generalize across different domains. The researchers argue that more principled domain reweighting could significantly enhance sample efficiency and downstream performance.
The proposed framework, called MixAtlas, introduces a structured way to select and combine data sources. Unlike existing approaches that rely on single-perspective tuning, MixAtlas evaluates multiple factors simultaneously. This compute-efficient method aims to balance the contribution of each data source during pretraining.
The paper emphasizes the need for better strategies in multimodal training. Existing recipes often fail to account for the complexity of real-world data mixtures. By addressing this gap, MixAtlas seeks to improve how models learn from diverse inputs.
The research will be discussed at the upcoming ICLR 2026 workshop in Vienna. It represents a step toward more efficient and effective multimodal AI training.
Source: machinelearning.apple.com