To meet the ever-present demand for smarter and more intelligent machines, increasing research efforts are focussed on developing novel artificial intelligence (AI) models. However, despite the promising algorithmic properties, many novel models do not compute well on existing hardware architectures like GPU and neural network processors. A salient example of such a class of models is Probabilistic Circuits (PC) used for neuro-symbolic AI, which requires sparse and irregular graph-based challenging computational patterns. This project takes on this challenge by developing a hardware/software co-optimized computation stack, enabling energy-constrained edge applications.
To address the execution bottlenecks of PCs (and similar irregular data flow graphs in general), several contributions are made across the hardware/software stack as follows:
• Application: The most suitable data representation is identified by developing analytical error and energy models of customized fixed and floating-point formats. A novel representation based on the posit format is also investigated.
• Compilation: Optimized mapping algorithms are developed to parallelize the workloads on general-purpose multithreaded CPU and dedicated hardware architectures by minimizing synchronization/communication overheads.
• Hardware: Two versions of dedicated DAG Processing Units (DPUs) are developed incorporating a dedicated spatial datapath, targeted interconnection network, precision-scalable arithmetic unit, and custom memory hierarchy.
• Implementation: The hardware innovations are realized and validated by the optimized physical implementation of the first version of DPU on chip in a 28nm CMOS technology.
The cohesive hardware/software optimizations achieve higher throughput than CPU and GPU, while operating at order of magnitude higher energy efficency. The main findings can be summarized as follows:
• An 8b posit can be customized to reach the same accuracy as the 32b float for PCs.
• Optimized mapping algorithms achieve a speed of 2× for multithreaded CPU execution.
• The 28nm DPU prototype achieves a speedup of 5× and 20× over CPU and GPU, while operating below 0.25W.
These results demonstrate that the project contributes important pieces enabling efficient execution of PC and similar irregular data flow graph-based workloads.