Efficient Processing of Deep Neural Networks


Book Description

This book provides a structured treatment of the key principles and techniques for enabling efficient processing of deep neural networks (DNNs). DNNs are currently widely used for many artificial intelligence (AI) applications, including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Therefore, techniques that enable efficient processing of deep neural networks to improve key metrics—such as energy-efficiency, throughput, and latency—without sacrificing accuracy or increasing hardware costs are critical to enabling the wide deployment of DNNs in AI systems. The book includes background on DNN processing; a description and taxonomy of hardware architectural approaches for designing DNN accelerators; key metrics for evaluating and comparing different designs; features of DNN processing that are amenable to hardware/algorithm co-design to improve energy efficiency and throughput; and opportunities for applying new technologies. Readers will find a structured introduction to the field as well as formalization and organization of key concepts from contemporary work that provide insights that may spark new ideas.




Accelerator Architecture for Secure and Energy Efficient Machine Learning


Book Description

ML applications are driving the next computing revolution. In this context both performance and security are crucial. We propose hardware/software co-design solutions for addressing both. First, we propose RNNFast, an accelerator for Recurrent Neural Networks (RNNs). RNNs are particularly well suited for machine learning problems in which context is important, such as language translation. RNNFast leverages an emerging class of non-volatile memory called domain-wall memory (DWM). We show that DWM is very well suited for RNN acceleration due to its very high density and low read/write energy. RNNFast is very efficient and highly scalable, with a flexible mapping of logical neurons to RNN hardware blocks. The accelerator is designed to minimize data movement by closely interleaving DWM storage and computation. We compare our design with a state-of-the-art GPGPU and find 21.8X higher performance with 70X lower energy. Second, we brought ML security into ML accelerator design for more efficiency and robustness. Deep Neural Networks (DNNs) are employed in an increasing number of applications, some of which are safety-critical. Unfortunately, DNNs are known to be vulnerable to so-called adversarial attacks. In general, the proposed defenses have high overhead, some require attack-specific re-training of the model or careful tuning to adapt to different attacks. We show that these approaches, while successful for a range of inputs, are insufficient to address stronger, high-confidence adversarial attacks. To address this, we propose HASI and DNNShield, two hardware-accelerated defenses that adapt the strength of the response to the confidence of the adversarial input. Both techniques rely on approximation or random noise deliberately introduced into the model. HASI uses direct noise injection into the model at inference. DNNShield uses approximation that relies on dynamic and random sparsification of the DNN model to achieve inference approximation efficiently and with fine-grain control over the approximation error. Both techniques use the output distribution characteristics of noisy/sparsified inference compared to a baseline output to detect adversarial inputs. We show an adversarial detection rate of 86% when applied to VGG16 and 88% when applied to ResNet50, which exceeds the detection rate of the state-of-the-art approaches, with a much lower overhead. We demonstrate a software/hardware-accelerated FPGA prototype, which reduces the performance impact of HASI and DNNShield relative to software-only CPU and GPU implementations.




Low-power Neural Network Accelerators


Book Description

This dissertation investigates design techniques involving custom Floating-Point (FP) computation for low-power neural network accelerators in resource-constrained embedded systems. It focuses on the sustainability of the future omnipresence of Artificial Intelligence (AI) through developing efficient hardware engines, emphasizing the balance between energy-efficient computations, inference quality, application versatility, and cross-platform compatibility. The research presents a hardware design methodology for low-power inference of Spike-by-Spike (SbS) neural networks. Despite the reduced complexity and noise robustness of SbS networks, their deployment in constrained embedded devices is challenging due to high memory and computational costs. The dissertation proposes a novel Multiply-Accumulate (MAC) hardware module that optimizes the balance between computational accuracy and resource efficiency in FP operations. This module employs a hybrid approach, combining standard FP with custom 8-bit FP and 4-bit logarithmic numerical representations, enabling customization based on application-specific constraints and implementing acceleration for the first time in embedded systems. Additionally, the study introduces a hardware design for low-power inference in Convolutional Neural Networks (CNNs), targeting sensor analytics applications. This proposes a Hybrid-Float6 (HF6) quantization scheme and a dedicated hardware accelerator. The proposed Quantization-Aware Training (QAT) method demonstrates improved quality despite the numerical quantization. The design ensures compatibility with standard ML frameworks such as TensorFlow Lite, highlighting its potential for practical deployment in real-world applications. This dissertation addresses the critical challenge of harmonizing computational accuracy with energy efficiency in AI hardware engines with inference quality, application versatility, and cross-platform compatibility as a design philosophy.




Design of High-performance and Energy-efficient Accelerators for Convolutional Neural Networks


Book Description

Deep neural networks (DNNs) have gained significant traction in artificial intelligence (AI) applications over the past decade owing to a drastic increase in their accuracy. This huge leap in accuracy, however, translates into a sizable model and high computational requirements, something which resource-limited mobile platforms struggle against. Embedding AI inference into various real-world applications requires the design of high-performance, area, and energy-efficient accelerator architectures. In this work, we address the problem of the inference accelerator design for dense and sparse convolutional neural networks (CNNs), a type of DNN which forms the backbone of modern vision-based AI systems. We first introduce a fully dense accelerator architecture referred to as the NeuroMAX accelerator. Most traditional dense CNN accelerators rely on single-core, linear processing elements (PEs), in conjunction with 1D dataflows, for accelerating the convolution operations in a CNN. This limits the maximum achievable ratio of peak throughput per PE count to unity. Most of the past works optimize their dataflows to attain close to 100% hardware utilization to reach this ratio. In the NeuroMAX accelerator, we design a high-throughput, multi-threaded, log-based PE core. The designed core provides a 200% increase in peak throughput per PE count while only incurring a 6% increase in the hardware area overhead compared to a single, linear multiplier PE core with the same output bit precision. NeuroMAX accelerator also uses a 2D weight broadcast dataflow which exploits the multi-threaded nature of the PE cores to achieve a high hardware utilization per layer for various dense CNN models. Sparse convolutional neural network models reduce the massive compute and memory bandwidth requirements inherently present in dense CNNs without a significant loss in accuracy. Designing sparse accelerators for the processing of sparse CNN models, however, is much more challenging compared to the design of dense CNN accelerators. The micro-architecture design, the design of sparse PEs, addressing the load-balancing issues, and the system-level architectural design issues for processing the entire sparse CNN model are some of the key technical challenges that need to be addressed in order to design a high-performance and energy-efficient sparse CNN accelerator architecture. We break this problem down into two parts. In the first part, using some of the concepts from the dense NeuroMAX accelerator, we introduce SparsePE, a multi-threaded, and flexible PE, capable of handling both the dense and sparse CNN model computations. The SparsePE core uses the binary mask representation to actively skip ineffective sparse computations involving zeros, and favors valid, non-zero computations, thereby, drastically increasing the effective throughput and the hardware utilization of the core as compared to a dense PE core. In the second part, we generate a two-dimensional (2D) mesh architecture of the SparsePE cores, which we refer to as the Phantom accelerator. We also propose a novel dataflow that supports processing of all layers of a CNN, including unit and non-unit stride convolutions (CONV), and fully-connected (FC) layers. In addition, the Phantom accelerator uses a two-level load balancing strategy to minimize the computational idling, thereby, further improving the hardware utilization, throughput, as well as the energy efficiency of the accelerator. The performance of the dense and the sparse accelerators is evaluated using a custom-built cycle accurate performance simulator and performance is compared against recent works. Logic utilization on hardware is also compared against the prior works. Finally, we conclude by mentioning some more techniques for accelerating CNNs and presenting some other avenues where the proposed work can be applied.




All Analog CNN Accelerator with RRAMs for Fast Inference


Book Description

As AI applications become more prevalent and powerful, the performance of deep learning neural network is more demanding. The need to enable fast and energy efficient circuits for computing deep neural networks is urgent. Most current research works propose dedicated hardware for data to reuse thousands of times. However, while re-using the same hardware to perform the same computation repeatedly saves area, it comes at the expense of execution time. This presents another critical obstacle, as the need for real-data and rapid AI requires a fundamentally faster approach to implementing neural networks. The focus of this thesis is to duplicate the key operation - multiply and accumulate (MAC) computation units, in the hardware so that there is no hardware re-use, enabling the entire neural network to be physically fabricated on a single chip. As neural networks today often require hundreds of thousands to tens of millions of MAC computation units, this requires designing the smallest MAC computation units to fit all of the operations on chip. Here, we present initial analysis on a convolutional neural network (CNN) accelerator that implements such a system, optimizing for inference speed. The accelerator duplicates all of the computation hardware, thus eliminating the need to fetch data back and forth while reusing the same hardware. We propose a novel design for memory cells using resistive random access memory (RRAM) and computation units utilizing the analog behavior of transistors. This circuit classifies one Cifar-10 dataset image in 6μs (160k frames/s) with 2.4[mu]J energy per classification with an accuracy of 85%. It contains 7.5 million MAC units and achieves 5 million MAC/mm2.







Towards Heterogeneous Multi-core Systems-on-Chip for Edge Machine Learning


Book Description

This book explores and motivates the need for building homogeneous and heterogeneous multi-core systems for machine learning to enable flexibility and energy-efficiency. Coverage focuses on a key aspect of the challenges of (extreme-)edge-computing, i.e., design of energy-efficient and flexible hardware architectures, and hardware-software co-optimization strategies to enable early design space exploration of hardware architectures. The authors investigate possible design solutions for building single-core specialized hardware accelerators for machine learning and motivates the need for building homogeneous and heterogeneous multi-core systems to enable flexibility and energy-efficiency. The advantages of scaling to heterogeneous multi-core systems are shown through the implementation of multiple test chips and architectural optimizations.




TinyML


Book Description

Deep learning networks are getting smaller. Much smaller. The Google Assistant team can detect words with a model just 14 kilobytes in size—small enough to run on a microcontroller. With this practical book you’ll enter the field of TinyML, where deep learning and embedded systems combine to make astounding things possible with tiny devices. Pete Warden and Daniel Situnayake explain how you can train models small enough to fit into any environment. Ideal for software and hardware developers who want to build embedded systems using machine learning, this guide walks you through creating a series of TinyML projects, step-by-step. No machine learning or microcontroller experience is necessary. Build a speech recognizer, a camera that detects people, and a magic wand that responds to gestures Work with Arduino and ultra-low-power microcontrollers Learn the essentials of ML and how to train your own models Train models to understand audio, image, and accelerometer data Explore TensorFlow Lite for Microcontrollers, Google’s toolkit for TinyML Debug applications and provide safeguards for privacy and security Optimize latency, energy usage, and model and binary size




Efficient Inference Acceleration


Book Description

Forward progress in computing technology is expected to involve high degrees of heterogeneity and specialization. Emerging applications integrating neural networks are becoming more common and as a result development of specialized hardware designed for acceleration of neural networks is increasingly economical. As Moore's law wanes and applications utilizing neural networks benefit from high-performance and low-power execution provided by widely available specialized hardware, algorithms using neural networks are poised to continue to outpace alternative approaches. This dissertation explores the design space of neural network inference accelerators, spanning from monolithic systolic arrays with off-chip DRAMs for weight storage to tiled matrix-vector units with tightly coupled on-chip weight storage to supply high bandwidth weights without dependence on off-chip memory, targeting efficient microarchitectural techniques and neural network inference sequencing schemes, identifying three key design points of interest. The first is a monolithic systolic array based accelerator where pipeline depths are reduced in order to eliminate clocked element overheads. These optimizations primarily target energy-efficiency but also improve performance subject to bandwidth limitations. The accelerator includes weight permutation considerations required to better support processing convolutional layers on wide arrays using scheduling policies that preserve temporal locality of weight sub-matrices. The second accelerator uses codebook quantization for both weights and activations to reduce power associated with both on-chip communication and synapse calculation. Codebook based quantization and dequantization are tightly integrated into the accelerator data-path enabling the bulk of on-chip communication to remain in the quantized format. Training experiments are presented to provide insight into training techniques for inference accelerators utilizing codebook quantization of both activations and weights. The third accelerator design considers communication power reduction within a tiled accelerator using temporally coded interconnects for both activations and weights. Tolerance for the latency of the temporal codes within neural network accelerators is achieved by scheduling schemes that facilitate reuse of temporally communicated values and buffer capacities provisioned to support these schedules. Within the accelerator with temporally coded links, these adverse effects amount to performance degradations rather than high power consumption.