University Research-2 | VITA Group@UT Austin

Efficient inference algorithms and architecture optimization

Efficient Inference Algorithms and Architecture Optimization

Building theory-informed, lightweight inference solutions and elastic model architectures that meet any device budget while preserving state-of-the-art accuracy.

Efficient inference algorithms and architecture optimization

01

Understanding and Mitigating Bottlenecks of State Space Models Through the Lens of Recency and Over-smoothing

ICLR 2025

Despite their hype as lightweight “Transformer-killers,” modern Structured State-Space Models (SSMs) like Mamba turn out to be both short-sighted (they forget tokens the moment they scroll off-screen) and over-smoothed (deep layers blur every token into the same mush). We formally prove this recency/blur trade-off mathematically, expose it in large-scale retrieval and adversarial tests, then fix both problems with one elegant tweak—polarization, which locks two hidden channels at 0 and 1 so the model simultaneously freezes a crisp snapshot of the past and prevents runaway averaging, restoring long-range recall and letting deeper SSMs outperform their unpatched cousin.

02

LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS

NeurIPS 2024 Spotlight

LightGaussian trims point-based scenes the smart way: it ranks every 3D Gaussian by a theory-inspired “global significance,” prunes the least useful ones, distills bulky spherical-harmonic lighting into leaner coefficients, and vector-quantizes what remains. Those three steps shrink a typical unbounded scene from 780 MB to 45 MB — about a 15× cut—while raising real-time rendering from 144 FPS to 237 FPS, all with virtually unchanged visual quality, so high-fidelity 3D splats finally fit on a single GPU or even mobile-grade hardware.

03

FLEXTRON: Many-in-One Flexible Large Language Model

ICML 2024 Oral
Productionized by NVIDIA

FLEXTRON turns a single, already-trained LLM into a “many-in-one” model: its nested, share-weight design lets it carve out dozens of smaller or larger sub-networks on the fly—matching a phone-sized budget or a datacenter GPU—without any extra fine-tuning. A tiny router then picks the best sub-network per request (or even per token), meeting a user-set latency or memory target while still beating comparably sized models and earlier elastic nets, all for just ≈ 7 % of the original pre-training cost.

04

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

NeurIPS 2023
Integrated into DeepSpeed by Microsoft, and Llama-Recipes by Meta

H2O realizes that during generation most attention weight concentrates on a tiny set of “heavy-hitter” tokens, so it keeps only those plus the most-recent words in the key–value cache; this simple eviction rule slashes cache memory by up to 5–20× while preserving model accuracy. With so much memory freed, a single GPU can handle far larger batches and prompts—boosting OPT-30B throughput by up to 29 × and trimming inference latencyalmost 2 × versus FlexGen, DeepSpeed-Zero, or Hugging Face Accelerate.

05

Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective

ICLR 2021
Covered by NSF News

TE-NAS was the first to bridge deep-learning theory with neural-architecture search: instead of retraining thousands of candidates, it scores each operator in a super-net using two training-free theory metrics—the condition number of its Neural Tangent Kernel (how easily it trains) and the number of linear regions it can carve out (how expressive it is). Armed with these instant diagnostics, TE-NAS discovers ImageNet-grade cells in just 0.5 – 4 GPU-hours, matching or topping state-of-the-art NAS methods that burn days of compute and hundreds of full model trainings.

06

The Lottery Ticket Hypothesis for Pre-trained BERT Networks

NeurIPS 2020
Covered by MIT News

For the first time, we discovered that a pre-trained BERT already contains “lottery-ticket” subnetworks: after trimming away 40–90% of its weights, these slim variants can be fine-tuned to match the full model on GLUE, SQuAD and other benchmarks. Most impressively, a single 70 %-sparse mask found with BERT’s own masked-language-model objective transfers intact to every downstream task they tried, hinting at BERT-level versatility with a fraction of the memory and compute - a strong "inductive bias" of pre-trained model weights.