MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
Mar 2, 2026·
,,,,,,·
1 min read
Bingbing Wen
Sirajul Salekin
Feiyang Kang
Bill Howe
Lucy Lu Wang
Javier Movellan
Manjot Bilkhu

Abstract
Principled domain reweighting can substantially improve sample efficiency and downstream generalization; however, data-mixture optimization for multimodal pretraining remains underexplored. We introduce MixAtlas, a principled framework for compute-efficient multimodal mixture optimization via systematic domain decomposition and smaller proxy models, factorizing training data along image concepts and task supervision axes.
Type
Publication
ICLR 2026 Workshop DATA-FM
We introduce MixAtlas, a principled framework for compute-efficient multimodal mixture optimization via systematic domain decomposition and smaller proxy models. MixAtlas factorizes training data along two interpretable axes—image concepts and task supervision—enabling interpretable mixture control. Using small proxy models and a Gaussian-process surrogate, we explore the mixture space at ~1/100th the cost of full-scale training, yielding up to 3× faster convergence and consistent gains of 2–5% across diverse benchmarks.