How to train LLMs with small batch sizes?
Here, I show you how to train large language models with small batch sizes,
something which is considered unnatural by the general ML community. The key
lies in selecting the β values for your Adam optimizer with change
in batch size, specifically β2, the EMA of the
preconditioning. As it turns out, as you decrease batch size, you end up having
robustness to hyper parameter misspecification, and (obviously) reduced memory
consumption. Accepted at NeurIPS 2025.