MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining

Mar 2, 2026·
Bingbing Wen
Bingbing Wen
,
Sirajul Salekin
,
Feiyang Kang
,
Bill Howe
,
Lucy Lu Wang
,
Javier Movellan
,
Manjot Bilkhu
· 1 min read
Abstract
Principled domain reweighting can substantially improve sample efficiency and downstream generalization; however, data-mixture optimization for multimodal pretraining remains underexplored. We introduce MixAtlas, a principled framework for compute-efficient multimodal mixture optimization via systematic domain decomposition and smaller proxy models, factorizing training data along image concepts and task supervision axes.
Type
Publication
ICLR 2026 Workshop DATA-FM

We introduce MixAtlas, a principled framework for compute-efficient multimodal mixture optimization via systematic domain decomposition and smaller proxy models. MixAtlas factorizes training data along two interpretable axes—image concepts and task supervision—enabling interpretable mixture control. Using small proxy models and a Gaussian-process surrogate, we explore the mixture space at ~1/100th the cost of full-scale training, yielding up to 3× faster convergence and consistent gains of 2–5% across diverse benchmarks.