Making large language models faster and more energy-efficient on edge devices such as smartphones, PCs, and in-vehicle systems is not a compute problem. It is a data movement problem. Generating each token requires the chip to read the entire context history from memory, yet edge devices offer memory bandwidth two orders of magnitude lower than server hardware, making memory access the decisive bottleneck for edge inference.
Speculative decoding offers a promising direction: a lightweight draft model predicts multiple candidate tokens ahead, and the target model verifies them all in one pass, spreading the memory access cost across more generated tokens. However, the target model must still reload the complete context history at every verification step, and as context grows, the memory bottleneck persists. This talk presents an algorithm-hardware co-design whose central finding is that the intermediate signals produced by the draft model before verification begins are a natural byproduct of the algorithm, and happen to reveal which parts of the context truly matter, a signal that hardware has never previously exploited.
The first half of the talk covers the fundamentals of large language model inference, the memory bottleneck on edge devices, and the basic principles of speculative decoding. The second half dives into the algorithm design and hardware microarchitecture, using analytical modeling to characterize efficiency trade-offs and explore how algorithm parameters and hardware design inform each other.
8/5/2026 11:00 - 12:00
ESAT Aula L