Context: State-of-the-art AI algorithms have varying workloads that make it challenging to process data efficiently, even with existing high-performance multicore architectures. A scalable heterogeneous multi-accelerator architecture allows us to achieve the highest efficiency and reconfigurability to handle the varying workload needs. However, managing compute resources (like CPU cores and accelerators), data (layout and transfers), and scalability for any AI algorithm is challenging. To achieve good performance, it is crucial to enable a very tight connection between the accelerators and memory, without giving up on scalability and the flexibility to enable mapping a wide variety of workloads.
Research goal: This research aims to develop a standardized accelerator shell with hardware features that support 1.) the tight coupling of the accelerator and memory for maximum compute-memory efficiency; and 2.) the ability to tie many accelerators together for scalability reasons. Each accelerator supports specific kernels in AI algorithms, and targets compute-bound throughput rather than memory-bound throughput. A standardized shell allows us to easily connect multiple cores, making it more reconfigurable to new algorithms that may appear. In this work, we propose the Snitch Accelerator Extension (SNAX) shell that consists of a light-weight management core, a tightly coupled data memory with streaming ports towards an accelerator, a smart DMA that can transform data layouts during data transfers, and a task buffer that acts like a hardware loop that can repetitively run a set of tasks.
SNAX is currently combined with the first set of accelerators for neural network processing: a GeMM accelerator and an activation accelerator.