Artificial Intelligence (AI) is today emerging as an indispensable tool for everyday life, powering web search summaries and specialized chatbot agents. Research has applied AI to various fields, including medicine, robotics, and finance; some experts even started talking about superhuman intelligence. Yet, the principal bottleneck limiting the realization of more sophisticated neural network models remains the available computer infrastructure, from ultra-constrained edge devices to large-scale data centers.
The research behind this thesis starts from the observation that modern computers are fundamentally limited in either efficiency or flexibility when addressing the demands of modern AI workloads. Neural network models, in fact, exhibit heterogeneous operations in type and shapes that stress computing systems, traditionally divided into general-purpose and application-specific. General-purpose systems often lack the necessary efficiency to meet power requirements, whereas application-specific designs fail to adapt to the variety of operations across different models, or even within the same neural network.
To bridge the gap between what today's computer architectures can deliver and what is required for current and future AI models, previous works have tried two different approaches: the bottom-up approach focuses on increasing the level of flexibility of application-specific designs by adding programmability or through heterogeneous integration of different accelerators that balance each other weaknesses; the top-down approach, instead, aims to improve general-purpose architectures efficiency by embedding specialized units while retaining broad flexibility. This thesis contributes to both these directions.
On the bottom-up side, the thesis presents the DIANA SoC, a micro controller system integrating two heterogeneous accelerators designed for convolutional neural network execution. One of these is based on the Analog In-Memory Computing (AIMC) paradigm, a promising but rigid technique that offers orders-of-magnitude energy gains at the cost of programmability and computation accuracy.
The thesis analyzes the software and hardware integration of AIMC as an independent accelerator. From the hardware perspective, our approach achieves up to 10 and 30 TOp/s/W in the ResNet18-ImageNet and ResNet20-CIFAR10 benchmarks, respectively. However, from a software perspective, the work reveals that AIMC requires intricate optimizations and challenging algorithmic integration, which limits its viability, especially when considering how fast AI workloads evolve.
On the top-down front, the thesis introduces key extensions to the open-source Vortex GPGPU platform.
These extensions include an innovative tensor core for matrix multiplication acceleration; our tensor core design achieves 3 times the performance of a baseline implementation by leveraging internal dedicated cache and buffer memories, native support for warp cooperation/specialization, and decoupled control logic.
Additionally, the thesis proposes microarchitecture enhancements to enable decoupled control flow and access/compute execution, yielding 8 times faster execution and 10 times reduction in dynamic instruction count for memory-intensive kernels.
In summary, this work explores both ends of the hardware design spectrum, from analog accelerators to general-purpose architectures enhanced with specialization, offering insights and practical solutions for building future-ready computing platforms tailored to the evolving needs of AI.
7/7/2025 9:30 - 11:30
ESAT Aula L