Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format

Man Shi, Arne Symons, Robin Geens, and Chao Fang , Marian Verhelst Hardware-efficient AI and ML

The rising demand for LLM deployment on edge devices could revolutionize how we interact with AI in daily life. Our work on Anda represents a significant step towards enabling efficient LLM inference on resource-constrained platforms, potentially bringing powerful language capabilities to smartphones and IoT devices while maintaining both performance and energy efficiency.

Research goal: Large Language Models (LLMs) have shown remarkable capabilities but face deployment challenges due to their massive computational demands. Weight-only quantization (keeping FP16 activations with INT4 weights) has emerged as a popular solution to reduce model size. However, processing FP activations remains a major bottleneck in terms of energy consumption and computational complexity, highlighting the need for more efficient activation processing solutions.

Gap in the SotA: Existing approaches to optimize FP activations face significant limitations. GPU implementations require costly conversions and FP computations, while dedicated FP-INT units suffer from high alignment and normalization overhead. Block Floating Point (BFP) solutions either need expensive retraining to maintain accuracy or use long mantissas that increase computation costs. No current solution effectively balances the critical triad of model accuracy, computational efficiency, and energy consumption.

Recent results: This work presents Anda, a comprehensive solution that achieves impressive efficiency gains while maintaining model accuracy. It introduces a variable-length grouped activation format with shared exponents and adjustable mantissa widths, coupled with a training-free adaptive precision search algorithm. The system includes hardware optimizations such as bit-plane data layout, bit-serial processing units, and runtime compression. Evaluations across OPT, LLaMA, and LLaMA-2 models demonstrate 2.4× speedup, 4.0× area efficiency, and 3.1× energy efficiency improvements over the GPU-like baseline while offering flexible accuracy-performance tradeoffs.

Fun fact: Anda is named after a lovely cat who brings good luck - just as we hope our work brings good fortune to efficient LLM deployment!

 

Get in touch
Man Shi
Phd student
Arne Symons
Phd student
Robin Geens
Phd student
Chao Fang
Phd student
Marian Verhelst
Academic staff

Publications about this research topic

Chao Fang, Man Shi, Robin Geens, Arne Symons, Zhongfeng Wang and Marian Verhelst. “Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format.” to appear in 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA 2025).

Other research topics in Hardware-efficient AI and ML

Massive parallelism for combinatorial optimisation problems
Hardware-efficient AI and ML
Toon Bettens and Sofie De Weer | Wim Dehaene and Marian Verhelst
Carbon-aware Design Space Exploration for AI Accelerators
Hardware-efficient AI and ML
Jiacong Sun | Georges Gielen and Marian Verhelst
Decoupled Control Flow and Memory Orchestration in the Vortex GPGPU
Hardware-efficient AI and ML
Giuseppe Sarda | Marian Verhelst
Automated Causal CNN Scheduling Optimizer for Real-Time Edge Accelerators
Hardware-efficient AI and ML
Jun Yin | Marian Verhelst
A Scalable Heterogenous Multi-accelerator Platform for AI and ML
Hardware-efficient AI and ML
Ryan Antonio | Marian Verhelst
Uncertainty-Aware Design Space Exploration for AI Accelerators
Hardware-efficient AI and ML
Jiacong Sun | Georges Gielen and Marian Verhelst
Integer GEMM Accelerator for SNAX
Hardware-efficient AI and ML
Xiaoling Yi | Marian Verhelst
Improving GPGPU micro architecture for future AI workloads
Hardware-efficient AI and ML
Giuseppe Sarda | Marian Verhelst
SRAM based digital in memory compute macro in 16nm
Hardware-efficient AI and ML
Weijie Jiang | Wim Dehaene
Scalable large array nanopore readouts for proteomics and next-generation sequencing
Analog and power management circuits, Hardware-efficient AI and ML, Biomedical circuits and sensor interfaces
Sander Crols | Filip Tavernier and Marian Verhelst
Design space exploration of in-memory computing DNN accelerators
Hardware-efficient AI and ML
Pouya Houshmand and Jiacong Sun | Marian Verhelst
Multi-core architecture exploration for layer-fused deep learning acceleration
Hardware-efficient AI and ML
Arne Symons | Marian Verhelst
HW-algorithm co-design for Bayesian inference of probabilistic machine learning
Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML
Shirui Zhao | Marian Verhelst
Design space exploration for machine learning acceleration
Hardware-efficient AI and ML
Arne Symons | Marian Verhelst
Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators
Hardware-efficient AI and ML
Arne Symons | Marian Verhelst
Optimized deployment of AI algorithms on rapidly-changing heterogeneous multi-core compute platforms
Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML
Josse Van Delm | Marian Verhelst
High-throughput high-efficiency SRAM for neural networks
Ultra-low power digital SoCs and memories, Hardware-efficient AI and ML
Wim Dehaene and Marian Verhelst
Heterogeneous Multi-core System-on-Chips for Ultra Low Power Machine Learning Application at the Edge
Hardware-efficient AI and ML
Pouya Houshmand, Giuseppe Sarda, and Ryan Antonio | Marian Verhelst

Want to work with us?

Get in touch or discover the way we can collaborate.