Architecture Design for Highly Flexible and Energy-efficient Deep Neural Network Accelerators


Book Description

Deep neural networks (DNNs) are the backbone of modern artificial intelligence (AI). However, due to their high computational complexity and diverse shapes and sizes, dedicated accelerators that can achieve high performance and energy efficiency across a wide range of DNNs are critical for enabling AI in real-world applications. To address this, we present Eyeriss, a co-design of software and hardware architecture for DNN processing that is optimized for performance, energy efficiency and flexibility. Eyeriss features a novel Row-Stationary (RS) dataflow to minimize data movement when processing a DNN, which is the bottleneck of both performance and energy efficiency. The RS dataflow supports highly-parallel processing while fully exploiting data reuse in a multi-level memory hierarchy to optimize for the overall system energy efficiency given any DNN shape and size. It achieves 1.4x to 2.5x higher energy efficiency than other existing dataflows. To support the RS dataflow, we present two versions of the Eyeriss architecture. Eyeriss v1 targets large DNNs that have plenty of data reuse. It features a flexible mapping strategy for high performance and a multicast on-chip network (NoC) for high data reuse, and further exploits data sparsity to reduce processing element (PE) power by 45% and off-chip bandwidth by up to 1.9x. Fabricated in a 65nm CMOS, Eyeriss v1 consumes 278 mW at 34.7 fps for the CONV layers of AlexNet, which is 10× more efficient than a mobile GPU. Eyeriss v2 addresses support for the emerging compact DNNs that introduce higher variation in data reuse. It features a RS+ dataflow that improves PE utilization, and a flexible and scalable NoC that adapts to the bandwidth requirement while also exploiting available data reuse. Together, they provide over 10× higher throughput than Eyeriss v1 at 256 PEs. Eyeriss v2 also exploits sparsity and SIMD for an additional 6× increase in throughput.




Efficient Processing of Deep Neural Networks


Book Description

This book provides a structured treatment of the key principles and techniques for enabling efficient processing of deep neural networks (DNNs). DNNs are currently widely used for many artificial intelligence (AI) applications, including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Therefore, techniques that enable efficient processing of deep neural networks to improve key metrics—such as energy-efficiency, throughput, and latency—without sacrificing accuracy or increasing hardware costs are critical to enabling the wide deployment of DNNs in AI systems. The book includes background on DNN processing; a description and taxonomy of hardware architectural approaches for designing DNN accelerators; key metrics for evaluating and comparing different designs; features of DNN processing that are amenable to hardware/algorithm co-design to improve energy efficiency and throughput; and opportunities for applying new technologies. Readers will find a structured introduction to the field as well as formalization and organization of key concepts from contemporary work that provide insights that may spark new ideas.




Design Space Exploration and Architecture Design for Inference and Training Deep Neural Networks


Book Description

Deep Neural Networks (DNNs) are widely used in various application domains and achieve remarkable results. However, DNNs require a large number of computations for both the inference and training phases. Hardware accelerators are designed and implemented to compute DNN models efficiently. Many accelerators have been proposed for DNN inference, while only a limited set of DNN training accelerators has been proposed. Almost all of these accelerators are highly custom-designed and limited in the types of networks they can process. This dissertation focuses on designing novel architectures and tools for efficient training of deep neural networks, particularly for edge applications. We proposed several novel architectures and a design space exploration tool. Our proposed architecture can be used for efficient processing of DNNs, and the design space exploration model could help DNN architects explore the design space of DNN architecture design for both inference and training and help home in on the optimal architecture in different hardware constraints in applications. The first area of contribution in this dissertation is the design of Socrates-D-1, a digital multicore on-chip learning architecture for deep neural networks. This processing unit design demonstrates the capability to process the training phase of DNNs efficiently. A statically time-multiplexed routing mechanism and a co-designed mapping method are also introduced to improve overall throughput and energy efficiency. The experimental results show 6.8 to 22.3 times speedup and more than a thousand times energy efficiency over a GPGPU. The proposed architecture is also compared with several DNN training accelerators and achieves the best energy and area efficiencies. The second area of contribution in this dissertation is the design of Socrates-D-2, which is an enhanced version of Socrates-D-1. This architecture presents a novel neural processing unit design. A dual-ported eDRAM memory replaces the double eDRAM memory design used in Socrates-D-1. In addition, a new mapping method utilizing neural network pruning techniques is introduced and evaluated with several datasets. The co-designed mapping methods helped the architecture achieve both throughput and energy efficiency without loss of accuracy. Compared with Socrates-D-1, this new architecture shows an average of 1.2 times higher energy efficiency and 1.25 times better area efficiency. The third area of contribution in this dissertation is the development of TRIM, a design space exploration model for DNN accelerators. TRIM is an infrastructure model and can explore the design space of DNN accelerators for training and inference. It utilizes a very flexible hardware template, which can model a wide range of architectures. TRIM explores the design space of data partition and reuse strategies for each hardware architecture and estimates the optimal time and energy. Our experimental results show that TRIM can achieve more than eighty percent accuracy on time and energy estimations. To the best of our knowledge, TRIM is the first infrastructure to model and explore the design space of DNN accelerators for training and inference. The fourth area of contribution in this dissertation is a set of design space explorations using TRIM. Through several case studies, we explored the design space of DNN accelerators for training and inference. We compared different dataflows and showed the impact of dataflow on efficient processing DNNs. We showed how to use TRIM to optimize the dataflow. We explored the design space of spatial architectures and showed the results of varying different hardware choices. Based on the exploration results, several high throughput and energy-efficient DNN training accelerators were presented. The fifth area of contribution in this dissertation is the design of an FPGA-based training accelerator for edge devices. We designed a CPU-FPGA accelerator that can operate under 5W. TRIM is utilized for dataflow optimization and hardware parameter selection. The experimental results show that we could achieve a 1.93 times speedup and 1.43 times energy efficiency for end-to-end training over a CPU implementation.




Accelerator Architecture for Secure and Energy Efficient Machine Learning


Book Description

ML applications are driving the next computing revolution. In this context both performance and security are crucial. We propose hardware/software co-design solutions for addressing both. First, we propose RNNFast, an accelerator for Recurrent Neural Networks (RNNs). RNNs are particularly well suited for machine learning problems in which context is important, such as language translation. RNNFast leverages an emerging class of non-volatile memory called domain-wall memory (DWM). We show that DWM is very well suited for RNN acceleration due to its very high density and low read/write energy. RNNFast is very efficient and highly scalable, with a flexible mapping of logical neurons to RNN hardware blocks. The accelerator is designed to minimize data movement by closely interleaving DWM storage and computation. We compare our design with a state-of-the-art GPGPU and find 21.8X higher performance with 70X lower energy. Second, we brought ML security into ML accelerator design for more efficiency and robustness. Deep Neural Networks (DNNs) are employed in an increasing number of applications, some of which are safety-critical. Unfortunately, DNNs are known to be vulnerable to so-called adversarial attacks. In general, the proposed defenses have high overhead, some require attack-specific re-training of the model or careful tuning to adapt to different attacks. We show that these approaches, while successful for a range of inputs, are insufficient to address stronger, high-confidence adversarial attacks. To address this, we propose HASI and DNNShield, two hardware-accelerated defenses that adapt the strength of the response to the confidence of the adversarial input. Both techniques rely on approximation or random noise deliberately introduced into the model. HASI uses direct noise injection into the model at inference. DNNShield uses approximation that relies on dynamic and random sparsification of the DNN model to achieve inference approximation efficiently and with fine-grain control over the approximation error. Both techniques use the output distribution characteristics of noisy/sparsified inference compared to a baseline output to detect adversarial inputs. We show an adversarial detection rate of 86% when applied to VGG16 and 88% when applied to ResNet50, which exceeds the detection rate of the state-of-the-art approaches, with a much lower overhead. We demonstrate a software/hardware-accelerated FPGA prototype, which reduces the performance impact of HASI and DNNShield relative to software-only CPU and GPU implementations.




Design of High-performance and Energy-efficient Accelerators for Convolutional Neural Networks


Book Description

Deep neural networks (DNNs) have gained significant traction in artificial intelligence (AI) applications over the past decade owing to a drastic increase in their accuracy. This huge leap in accuracy, however, translates into a sizable model and high computational requirements, something which resource-limited mobile platforms struggle against. Embedding AI inference into various real-world applications requires the design of high-performance, area, and energy-efficient accelerator architectures. In this work, we address the problem of the inference accelerator design for dense and sparse convolutional neural networks (CNNs), a type of DNN which forms the backbone of modern vision-based AI systems. We first introduce a fully dense accelerator architecture referred to as the NeuroMAX accelerator. Most traditional dense CNN accelerators rely on single-core, linear processing elements (PEs), in conjunction with 1D dataflows, for accelerating the convolution operations in a CNN. This limits the maximum achievable ratio of peak throughput per PE count to unity. Most of the past works optimize their dataflows to attain close to 100% hardware utilization to reach this ratio. In the NeuroMAX accelerator, we design a high-throughput, multi-threaded, log-based PE core. The designed core provides a 200% increase in peak throughput per PE count while only incurring a 6% increase in the hardware area overhead compared to a single, linear multiplier PE core with the same output bit precision. NeuroMAX accelerator also uses a 2D weight broadcast dataflow which exploits the multi-threaded nature of the PE cores to achieve a high hardware utilization per layer for various dense CNN models. Sparse convolutional neural network models reduce the massive compute and memory bandwidth requirements inherently present in dense CNNs without a significant loss in accuracy. Designing sparse accelerators for the processing of sparse CNN models, however, is much more challenging compared to the design of dense CNN accelerators. The micro-architecture design, the design of sparse PEs, addressing the load-balancing issues, and the system-level architectural design issues for processing the entire sparse CNN model are some of the key technical challenges that need to be addressed in order to design a high-performance and energy-efficient sparse CNN accelerator architecture. We break this problem down into two parts. In the first part, using some of the concepts from the dense NeuroMAX accelerator, we introduce SparsePE, a multi-threaded, and flexible PE, capable of handling both the dense and sparse CNN model computations. The SparsePE core uses the binary mask representation to actively skip ineffective sparse computations involving zeros, and favors valid, non-zero computations, thereby, drastically increasing the effective throughput and the hardware utilization of the core as compared to a dense PE core. In the second part, we generate a two-dimensional (2D) mesh architecture of the SparsePE cores, which we refer to as the Phantom accelerator. We also propose a novel dataflow that supports processing of all layers of a CNN, including unit and non-unit stride convolutions (CONV), and fully-connected (FC) layers. In addition, the Phantom accelerator uses a two-level load balancing strategy to minimize the computational idling, thereby, further improving the hardware utilization, throughput, as well as the energy efficiency of the accelerator. The performance of the dense and the sparse accelerators is evaluated using a custom-built cycle accurate performance simulator and performance is compared against recent works. Logic utilization on hardware is also compared against the prior works. Finally, we conclude by mentioning some more techniques for accelerating CNNs and presenting some other avenues where the proposed work can be applied.




Data Orchestration in Deep Learning Accelerators


Book Description

This Synthesis Lecture focuses on techniques for efficient data orchestration within DNN accelerators. The End of Moore's Law, coupled with the increasing growth in deep learning and other AI applications has led to the emergence of custom Deep Neural Network (DNN) accelerators for energy-efficient inference on edge devices. Modern DNNs have millions of hyper parameters and involve billions of computations; this necessitates extensive data movement from memory to on-chip processing engines. It is well known that the cost of data movement today surpasses the cost of the actual computation; therefore, DNN accelerators require careful orchestration of data across on-chip compute, network, and memory elements to minimize the number of accesses to external DRAM. The book covers DNN dataflows, data reuse, buffer hierarchies, networks-on-chip, and automated design-space exploration. It concludes with data orchestration challenges with compressed and sparse DNNs and future trends. The target audience is students, engineers, and researchers interested in designing high-performance and low-energy accelerators for DNN inference.













Deep Learning for Computer Architects


Book Description

This is a primer written for computer architects in the new and rapidly evolving field of deep learning. It reviews how machine learning has evolved since its inception in the 1960s and tracks the key developments leading up to the emergence of the powerful deep learning techniques that emerged in the last decade. Machine learning, and specifically deep learning, has been hugely disruptive in many fields of computer science. The success of deep learning techniques in solving notoriously difficult classification and regression problems has resulted in their rapid adoption in solving real-world problems. The emergence of deep learning is widely attributed to a virtuous cycle whereby fundamental advancements in training deeper models were enabled by the availability of massive datasets and high-performance computer hardware. It also reviews representative workloads, including the most commonly used datasets and seminal networks across a variety of domains. In addition to discussing the workloads themselves, it also details the most popular deep learning tools and show how aspiring practitioners can use the tools with the workloads to characterize and optimize DNNs. The remainder of the book is dedicated to the design and optimization of hardware and architectures for machine learning. As high-performance hardware was so instrumental in the success of machine learning becoming a practical solution, this chapter recounts a variety of optimizations proposed recently to further improve future designs. Finally, it presents a review of recent research published in the area as well as a taxonomy to help readers understand how various contributions fall in context.