Blog — Aditya Somasundaram

How to train LLMs with small batch sizes?

November 1, 2025

Here, I show you how to train large language models with small batch sizes, something which is considered unnatural by the general ML community. The key lies in selecting the β values for your Adam optimizer with change in batch size, specifically β₂, the EMA of the preconditioning. As it turns out, as you decrease batch size, you end up having robustness to hyper parameter misspecification, and (obviously) reduced memory consumption. Accepted at NeurIPS 2025.