Aditya Somasundaram

as7458 [at] columbia [dot] edu

Blog


How to train LLMs with small batch sizes?

Here, I show you how to train large language models with small batch sizes, something which is considered unnatural by the general ML community. The key lies in selecting the β values for your Adam optimizer with change in batch size, specifically β2, the EMA of the preconditioning. As it turns out, as you decrease batch size, you end up having robustness to hyper parameter misspecification, and (obviously) reduced memory consumption. Accepted at NeurIPS 2025.