Learning in Energy-Efficient Neuromorphic Computing: Algorithm and Architecture Co-Design


Book Description

Explains current co-design and co-optimization methodologies for building hardware neural networks and algorithms for machine learning applications This book focuses on how to build energy-efficient hardware for neural networks with learning capabilities—and provides co-design and co-optimization methodologies for building hardware neural networks that can learn. Presenting a complete picture from high-level algorithm to low-level implementation details, Learning in Energy-Efficient Neuromorphic Computing: Algorithm and Architecture Co-Design also covers many fundamentals and essentials in neural networks (e.g., deep learning), as well as hardware implementation of neural networks. The book begins with an overview of neural networks. It then discusses algorithms for utilizing and training rate-based artificial neural networks. Next comes an introduction to various options for executing neural networks, ranging from general-purpose processors to specialized hardware, from digital accelerator to analog accelerator. A design example on building energy-efficient accelerator for adaptive dynamic programming with neural networks is also presented. An examination of fundamental concepts and popular learning algorithms for spiking neural networks follows that, along with a look at the hardware for spiking neural networks. Then comes a chapter offering readers three design examples (two of which are based on conventional CMOS, and one on emerging nanotechnology) to implement the learning algorithm found in the previous chapter. The book concludes with an outlook on the future of neural network hardware. Includes cross-layer survey of hardware accelerators for neuromorphic algorithms Covers the co-design of architecture and algorithms with emerging devices for much-improved computing efficiency Focuses on the co-design of algorithms and hardware, which is especially critical for using emerging devices, such as traditional memristors or diffusive memristors, for neuromorphic computing Learning in Energy-Efficient Neuromorphic Computing: Algorithm and Architecture Co-Design is an ideal resource for researchers, scientists, software engineers, and hardware engineers dealing with the ever-increasing requirement on power consumption and response time. It is also excellent for teaching and training undergraduate and graduate students about the latest generation neural networks with powerful learning capabilities.




Efficient Processing of Deep Neural Networks


Book Description

This book provides a structured treatment of the key principles and techniques for enabling efficient processing of deep neural networks (DNNs). DNNs are currently widely used for many artificial intelligence (AI) applications, including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Therefore, techniques that enable efficient processing of deep neural networks to improve key metrics—such as energy-efficiency, throughput, and latency—without sacrificing accuracy or increasing hardware costs are critical to enabling the wide deployment of DNNs in AI systems. The book includes background on DNN processing; a description and taxonomy of hardware architectural approaches for designing DNN accelerators; key metrics for evaluating and comparing different designs; features of DNN processing that are amenable to hardware/algorithm co-design to improve energy efficiency and throughput; and opportunities for applying new technologies. Readers will find a structured introduction to the field as well as formalization and organization of key concepts from contemporary work that provide insights that may spark new ideas.




Energy Efficiency and Robustness of Advanced Machine Learning Architectures


Book Description

Machine Learning (ML) algorithms have shown a high level of accuracy, and applications are widely used in many systems and platforms. However, developing efficient ML-based systems requires addressing three problems: energy-efficiency, robustness, and techniques that typically focus on optimizing for a single objective/have a limited set of goals. This book tackles these challenges by exploiting the unique features of advanced ML models and investigates cross-layer concepts and techniques to engage both hardware and software-level methods to build robust and energy-efficient architectures for these advanced ML networks. More specifically, this book improves the energy efficiency of complex models like CapsNets, through a specialized flow of hardware-level designs and software-level optimizations exploiting the application-driven knowledge of these systems and the error tolerance through approximations and quantization. This book also improves the robustness of ML models, in particular for SNNs executed on neuromorphic hardware, due to their inherent cost-effective features. This book integrates multiple optimization objectives into specialized frameworks for jointly optimizing the robustness and energy efficiency of these systems. This is an important resource for students and researchers of computer and electrical engineering who are interested in developing energy efficient and robust ML.







Design of High-performance and Energy-efficient Accelerators for Convolutional Neural Networks


Book Description

Deep neural networks (DNNs) have gained significant traction in artificial intelligence (AI) applications over the past decade owing to a drastic increase in their accuracy. This huge leap in accuracy, however, translates into a sizable model and high computational requirements, something which resource-limited mobile platforms struggle against. Embedding AI inference into various real-world applications requires the design of high-performance, area, and energy-efficient accelerator architectures. In this work, we address the problem of the inference accelerator design for dense and sparse convolutional neural networks (CNNs), a type of DNN which forms the backbone of modern vision-based AI systems. We first introduce a fully dense accelerator architecture referred to as the NeuroMAX accelerator. Most traditional dense CNN accelerators rely on single-core, linear processing elements (PEs), in conjunction with 1D dataflows, for accelerating the convolution operations in a CNN. This limits the maximum achievable ratio of peak throughput per PE count to unity. Most of the past works optimize their dataflows to attain close to 100% hardware utilization to reach this ratio. In the NeuroMAX accelerator, we design a high-throughput, multi-threaded, log-based PE core. The designed core provides a 200% increase in peak throughput per PE count while only incurring a 6% increase in the hardware area overhead compared to a single, linear multiplier PE core with the same output bit precision. NeuroMAX accelerator also uses a 2D weight broadcast dataflow which exploits the multi-threaded nature of the PE cores to achieve a high hardware utilization per layer for various dense CNN models. Sparse convolutional neural network models reduce the massive compute and memory bandwidth requirements inherently present in dense CNNs without a significant loss in accuracy. Designing sparse accelerators for the processing of sparse CNN models, however, is much more challenging compared to the design of dense CNN accelerators. The micro-architecture design, the design of sparse PEs, addressing the load-balancing issues, and the system-level architectural design issues for processing the entire sparse CNN model are some of the key technical challenges that need to be addressed in order to design a high-performance and energy-efficient sparse CNN accelerator architecture. We break this problem down into two parts. In the first part, using some of the concepts from the dense NeuroMAX accelerator, we introduce SparsePE, a multi-threaded, and flexible PE, capable of handling both the dense and sparse CNN model computations. The SparsePE core uses the binary mask representation to actively skip ineffective sparse computations involving zeros, and favors valid, non-zero computations, thereby, drastically increasing the effective throughput and the hardware utilization of the core as compared to a dense PE core. In the second part, we generate a two-dimensional (2D) mesh architecture of the SparsePE cores, which we refer to as the Phantom accelerator. We also propose a novel dataflow that supports processing of all layers of a CNN, including unit and non-unit stride convolutions (CONV), and fully-connected (FC) layers. In addition, the Phantom accelerator uses a two-level load balancing strategy to minimize the computational idling, thereby, further improving the hardware utilization, throughput, as well as the energy efficiency of the accelerator. The performance of the dense and the sparse accelerators is evaluated using a custom-built cycle accurate performance simulator and performance is compared against recent works. Logic utilization on hardware is also compared against the prior works. Finally, we conclude by mentioning some more techniques for accelerating CNNs and presenting some other avenues where the proposed work can be applied.




Vision on the Edge


Book Description







All-digital Time-domain CNN Engine for Energy Efficient Edge Computing


Book Description

Machine Learning is finding applications in a wide variety of areas ranging from autonomous cars to genomics. Machine learning tasks such as image classification, speech recognition and object detection are being used in most of the modern computing systems. In particular, Convolutional Neural Networks (CNNs, class of artificial neural networks) are extensively used for many such ML applications, due to their state of the art classification accuracy at a much lesser complexity compared to their fully connected network counterpart. However, the CNN inference process requires intensive compute and memory resources making it challenging to implement in energy constrained edge devices. The major operation of a CNN is the Multiplication and Accumulate (MAC) operation. These operations are traditionally performed by digital adders and multipliers, which dissipates large amount of power. In this 2-phase work, an energy efficient time-domain approach is used to perform the MAC operation using the concept of Memory Delay Line (MDL). Phase I of this work implements LeNet-5 CNN to classify MNIST dataset (handwritten digits) and is demonstrated on a commercial 40nm CMOS Test-chip. Phase II of this work aims to scale-up this work for multi-bit weights and implements AlexNet CNN to classify 1000-class ImageNet dataset images