Research Goal: The aim of this research project is to design efficient and scalable hardware systems for probabilistic machine learning models. Unlike the "black box" deep learning methods, probabilistic models are gaining popularity due to their ability to integrate domain knowledge, deal with uncertainty, and produce interpretable results. However, the inference on probabilistic models is computationally intensive and requires a large memory footprint. Designing hardware for probabilistic models allows for the optimization of computation and memory utilization, leading to faster and more energy-efficient processing. Additionally, dedicated hardware can enable the use of probabilistic models in resource-constrained environments, such as mobile devices and Internet of Things (IoT) devices.
Gap in SotA: The current state-of-the-art (SotA) in ML processors is primarily focused on accelerating deep learning workloads, with little emphasis on Bayesian or probabilistic inference acceleration. The major challenges in accelerating Bayesian inference operations are its need for sequential data processing and its frequent updates of large amounts of (irregular) data structures, making it difficult to map the computation on widely parallel hardware platforms. This results in a lack of energy efficient compute platforms for the compute intensive probabilistic inference algorithms, preventing their application in edge devices. There is hence a clear need for flexible hardware solutions that can handle the dynamic and changing requirements of Bayesian inference algorithms in real-world applications.
Results: This research project started with the development of a basic hardware block for probabilistic inference: the Knuth-Yao sampler. Generating random variables is the fundamental operation in this field, as typical approximate Bayesian inference involves the generation of billions of probabilistic values. As such, hardware samplers became the bottleneck to reduce overall energy consumption and increase performance. The novel reconfigurable Knuth-Yao sampling architecture that supports both flexible range and dynamic precision provides up to 13x energy efficiency benefits and 11x area efficiency improvement over the traditional linear CDT-based samplers used in the SotA accelerators which is suitable for our workloads.
This sampler is subsequently integrated in a 16-core approximate inference accelerator, in which each of the CPU's has an ISA enhanced with probabilistic inference. The added instructions include a sampling operation, as well as LUT-based special function approximations and the ability of cores to access the register files of neightboring cores. The altter allows cores operating on different variables in parallel to quickly exchange information and axchieve highm inference throughput. The chip has been published at ESSERC2024.