
Theory-driven Scalable Optimization for GenAI Model Training
Designing theory-grounded optimization methods that slash memory and compute for training large language models, diffusion models, and 3D Gaussians.
Theory-driven scalable optimization for GenAI model training
01
Apollo: SGD-Like Memory, ADAMW-Level Performance
-
MLSys 2025 Outstanding Paper Honorable Mention
-
Integrated into HuggingFace, LLaMA-Factory, Axolotl, FluxML, etc.
Most of the memory used when training large language models actually sits in the AdamW optimizer, not the model itself; APOLLO swaps AdamW’s element-wise learning-rate tracking for a simple low-rank approximation so compact that it cuts optimizer memory to near-SGD levels yet still matches (and often beats) AdamW’s accuracy. These savings translate into real wins: teams can triple throughput on an 8× A100-80 GB cluster, fit Llama-13B on a single 80 GB GPU with vanilla DDP, and—using the ultra-light APOLLO-Mini plus 8-bit weights—even pre-train Llama-7B in just 12 GB of GPU memory

02
HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training
-
ICML 2025
​Training LLMs across geo-distributed data-centers normally crawls because every GPU has to wait for the slowest link, but our proposed HALoS algorithm sidesteps that traffic jam by letting each region’s “mini-server” do most of the updating locally and send only occasional, compressed summaries to a global server—no blocking, no stragglers. This hierarchical, fully asynchronous trick comes with a rigorous convergence proof and, in realistic geo-distributed simulations, cuts total training time by up to 7.5× while matching the accuracy of fully synchronous runs, so scattered GPUs behave almost like a single super-cluster.

03
Steepest Descent Density Control for Compact 3D Gaussian Splatting
-
CVPR 2025
SteepGS gives 3D-Gaussian-Splatting a theory-driven makeover: instead of blindly duplicating points, it pinpoints Gaussians that have stalled at optimization “saddle points,” then splits each one just once—along the steepest downhill direction and at half opacity—to keep learning moving without runaway growth. That simple rule of thumb cuts the point cloud by!50 % while matching or even improving view quality and frame-rate, so high-fidelity scenes that once demanded heavyweight GPUs can now run comfortably on a single card or even mobile-class hardware.

04
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
-
ICML 2024 Oral
-
Integrated into HuggingFace, PyTorch, LLaMA-Factory, FedML, Axolotl, etc.
GaLore squeezes the “hidden bulk” out of large-model training by noticing that each weight-update matrix is mostly redundant; it projects those gradients onto a tiny low-rank core before the optimizer ever stores them, slashing optimizer-state memory by up to 82 % and total GPU footprint by roughly two-thirds while still learning every parameter. That efficiency jump means you can pre-train or fine-tune models as big as Llama-7B on a single 24 GB RTX 4090, or run bigger batches on multi-GPU rigs without activation checkpointing or offloading — turning hardware once reserved for hobbyists into an honest research platform.

05
Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models
-
NeurIPS 2023
Patch Diffusion pioneered to train image-diffusion models in bite-sized chunks: it crops random patches, tags them with their on-image coordinates, and varies patch sizes so the network still “sees” whole-image structure — cutting training time rby 50% and letting the same U-Net sample images exactly as before. Despite the lighter diet, it matches or beats full-image baselines (e.g., FID 1.77 on CelebA-64, 2.72 on ImageNet-256) and can train usable generators from as few as 5,000 pictures, democratizing diffusion training for researchers without giant GPUs or datasets.

06
Learning to Grow pretrained Models for Efficient Transformer Training
-
ICLR 2023 Spotlight
-
Implemented in IBM’s AI production system
-
Covered by MIT News
LiGO turns model scaling into a smart copy-and-paste: built on the rigorous theory of transformer weight equivariance, it learns a tiny linear “growth operator” that stretches the weights of a small, already-trained transformer into a larger, deeper-wider one in just ~100 gradient steps, instead of restarting training from scratch. That quick transplant chops training compute by 45–55 % and about halves wall-clock time for BERT, GPT-2 and even ImageNet vision transformers—while matching or beating scratch baselines and stacking neatly with other efficiency tricks.
