top of page

Efficient inference algorithms and architecture optimization

Efficient inference algorithms and architecture optimization

Efficient Inference Algorithms and Architecture Optimization

Building theory-informed, lightweight inference solutions and elastic model architectures that meet any device budget while preserving state-of-the-art accuracy.

Efficient inference algorithms and architecture optimization

01

Understanding and Mitigating Bottlenecks of State Space Models Through the Lens of Recency and Over-smoothing
  • ICLR 2025

Despite their hype as lightweight “Transformer-killers,” modern Structured State-Space Models (SSMs) like Mamba turn out to be both short-sighted (they forget tokens the moment they scroll off-screen) and over-smoothed (deep layers blur every token into the same mush). We formally prove this recency/blur trade-off mathematically, expose it in large-scale retrieval and adversarial tests, then fix both problems with one elegant tweak—polarization, which locks two hidden channels at 0 and 1 so the model simultaneously freezes a crisp snapshot of the past and prevents runaway averaging, restoring long-range recall and letting deeper SSMs outperform their unpatched cousin.

Thrust2-1.jpg

02

LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS
  • NeurIPS 2024 Spotlight

LightGaussian trims point-based scenes the smart way: it ranks every 3D Gaussian by a theory-inspired “global significance,” prunes the least useful ones, distills bulky spherical-harmonic lighting into leaner coefficients, and vector-quantizes what remains. Those three steps shrink a typical unbounded scene from 780 MB to 45 MB — about a 15× cut—while raising real-time rendering from 144 FPS to 237 FPS, all with virtually unchanged visual quality, so high-fidelity 3D splats finally fit on a single GPU or even mobile-grade hardware.

Thrust2-2.jpg

03

FLEXTRON: Many-in-One Flexible Large Language Model
  • ICML 2024 Oral

  • Productionized by NVIDIA

FLEXTRON turns a single, already-trained LLM into a “many-in-one” model: its nested, share-weight design lets it carve out dozens of smaller or larger sub-networks on the fly—matching a phone-sized budget or a datacenter GPU—without any extra fine-tuning. A tiny router then picks the best sub-network per request (or even per token), meeting a user-set latency or memory target while still beating comparably sized models and earlier elastic nets, all for just ≈ 7 % of the original pre-training cost.

Thrust2-3.jpg

04

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
  • NeurIPS 2023

  • Integrated into DeepSpeed by Microsoft, and Llama-Recipes by Meta

H2O realizes that during generation most attention weight concentrates on a tiny set of “heavy-hitter” tokens, so it keeps only those plus the most-recent words in the key–value cache; this simple eviction rule slashes cache memory by up to 5–20× while preserving model accuracy. With so much memory freed, a single GPU can handle far larger batches and prompts—boosting OPT-30B throughput by up to 29 × and trimming inference latencyalmost 2 × versus FlexGen, DeepSpeed-Zero, or Hugging Face Accelerate.

Thrust2-4.jpg

05

Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective
  • ICLR 2021

  • Covered by NSF News

TE-NAS was the first to bridge deep-learning theory with neural-architecture search: instead of retraining thousands of candidates, it scores each operator in a super-net using two training-free theory metrics—the condition number of its Neural Tangent Kernel (how easily it trains) and the number of linear regions it can carve out (how expressive it is). Armed with these instant diagnostics, TE-NAS discovers ImageNet-grade cells in just 0.5 – 4 GPU-hours, matching or topping state-of-the-art NAS methods that burn days of compute and hundreds of full model trainings.

Thrust2-5.jpg

06

The Lottery Ticket Hypothesis for Pre-trained BERT Networks
  • NeurIPS 2020

  • Covered by MIT News

For the first time, we discovered that a pre-trained BERT already contains “lottery-ticket” subnetworks: after trimming away 40–90% of its weights, these slim variants can be fine-tuned to match the full model on GLUE, SQuAD and other benchmarks. Most impressively, a single 70 %-sparse mask found with BERT’s own masked-language-model objective transfers intact to every downstream task they tried, hinting at BERT-level versatility with a fraction of the memory and compute - a strong "inductive bias" of pre-trained model weights.

Thrust2-6.jpg

Back to Top

bottom of page