Model Compression for Efficient Machine Learning Inference


Book Description

This dissertation presents model compression methods to facilitate the practicality of deep learning and machine learning frameworks for real-time applications. Starting from conventional compression techniques such as quantization to reduce bit-widths, we extend to developing novel and compact frameworks through a lossless compression approach. We begin with an extreme network quantization algorithm to compress a floating-point deep neural network using single bit representations. The training is done in two rounds to preserve the model performance, first in a weight compressed real-valued network and then in a bitwise version with the same topology. The pretrained weights of the first round are used to initialize the weights of the bitwise network, where we redefine the feedforward procedure with bitwise values and operations. Only the bitwise network is used for deployment for test time inference, which not only makes it easier to put on small devices but also expedites the inference speed with bitwise arithmetic operations. For this study, we aim at compressing a recurrent neural network architecture for single-channel source separation. Applying extreme quantization on this type of network poses additional challenges due to its complex recurrent relations as quantization noise can accumulate over multiple time frames. We address this by proposing a more delicate solution to incrementally binarize the model parameters in order to minimize the potential loss that can occur from a sudden introduction of quantization. As the proposed binarization technique turns only a few randomly chosen parameters into their binary versions, it gives the network training procedure a chance to gently adapt to the partly quantized version of the network. It eventually achieves the full binarization by incrementally increasing the amount of binarization over the iterations. Binarization can be extended to data compression to provide the same benefits of extreme compression rates and expedited inference speeds using supported algorithms and hardware. Similarly to binarizing model weights, we propose to compress the bitwidths of data down to binary form with emphasis on minimizing loss of information. To this end, we introduce locality sensitive hash functions (LSH) to reduce the storage overhead while preserving the semantic similarity between the high-dimensional data points in the Euclidean space and binary codes. However, given the random nature of LSH projection vectors, a large bitstring is required to form discriminative hash codes that can guarantee high precision. In this dissertation, we propose to learn the locality sensitive hash functions using boosting theory to efficiently encode the underlying structure of data into hash codes. Our adaptive boosting algorithm learns simple logistic regressors as the weak learners. The algorithm differs from AdaBoost in the sense that the projections are trained to minimize the distances between the self-similarity matrix of the hash codes and that of the original data points, rather than the misclassification rate. We evaluate our discriminative hash codes on a source separation problem framed as a similarity search task. Upon training our hash functions, their binary classification results transform each data point into a bit string, on which simple bitwise operations calculate Hamming distance to find the nearest neighbors from the hashed dictionary. Quantization and other model compression methods can achieve good compression rates, but they are applied as a post-training procedure that propagate noise and decrease generalization performance. Quantization-aware training helps to minimize the accuracy drop by simulating the low precision inference using the same floating point backpropagation, there is a limit to the amount of recovery from this fine-tuning procedure. Furthermore, quantized models demand dedicated hardware designs to support bit-level manipulation in memory and computation units to reap the benefits from model reduction. We address this worsened generalization and hardware compatibility issue of model compression methods by improving compact models to outperform larger model counterparts as a form of lossless compression. The first approach is personalization, in which small models are fine-tuned to their test-time specificity. Personalized compact models are trained in original floating-point values without structural modifications, and do not require any specialized hardware. We aim at use-cases for end-user devices in realistic settings where we often encounter only a few classes within a target domain that tend to reoccur in the specific environment. Hence, we postulate a small personalized model suffices to handle this focused subset of the original universal problem. Our goal in this test-time adaptation is to develop personalized speech enhancement model targeting edge-devices that can perform well for relevant users' voices and surrounding acoustics (e.g. a family-owned smart assistant device). One major challenge for personalization is a major data shortage issue due to recent privacy infringement and data leakage issues. Our goal in this test-time adaptation is to perform personalized speech enhancement without utilizing clean speech target of the test speaker using a knowledge distillation framework. We distill the denoising results from an overly large teacher model, and use them as the pseudo target to train the small student model. Experimental results show that the personalized models outperform larger non-personalized baseline models, demonstrating that personalization achieves model compression with no loss of denoising performance. Finally, we propose another lossless approach using evolutionary algorithms to optimize compact generative adversarial networks. We coordinate the adversarial characteristics with a coevolutionary strategy and evolve a population of models to achieve high fitness corresponding to generative performance and training stability. Our framework exposes individuals to not only various but also fit and stronger adversaries per generation to learn robust and compact models for efficient and faster inference. The experimental results demonstrate generative models trained using the proposed coevolutionary strategy can produce small models capable of outperforming larger counterparts trained under the regular adversarial framework.




Co-designing Model Compression Algorithms and Hardware Accelerators for Efficient Deep Learning


Book Description

Over the past decade, machine learning (ML) with deep neural networks (DNNs) has become extremely successful in a variety of application domains including computer vision, natural language processing, and game AI. DNNs are now a primary topic of academic research among computer scientists, and a key component of commercial technologies such as web search, recommendation systems, and self-driving vehicles. However, factors such as the growing complexity of DNN models, the diminished benefits of technology scaling, and the proliferation of resource-constrained edge devices are driving a demand for higher DNN performance and energy efficiency. Consequently, neural network training and inference have begun to shift from commodity general-purpose processors (e.g., CPUs and GPUs) to custom-built hardware accelerators (e.g., FPGAs and ASICs). In line with this trend, there has been extensive research on specialized algorithms and architectures for dedicated DNN processors. Furthermore, the rapid pace of innovation in DNN algorithm space is mismatched with the time-consuming process of hardware implementation. This has generated increased interest in novel design methodologies and tools which can reduce the human effort and turn-around time of hardware design. This thesis studies how low-precision quantization and structured matrices can improve the performance and energy efficiency of DNNs running on specialized accelerators. We co-design both the DNN compression algorithms and the accelerator architectures, enabling us to evaluate the impact of our ideas on real hardware. In the process, we examine the use of high-level synthesis tools in reducing the hardware design effort. This thesis represents a cross-domain research effort at efficient deep learning. First, we propose specialized architectures for accelerating binarized neural networks on FPGA. Second, we study novel high-level synthesis techniques to reduce the manual effort in FPGA accelerator design. Third, we show a fundamental link between group convolutions and circulant matrices, two previously disparate lines of research in DNN compression. Using this insight we propose HadaNet, an alternative to circulant compression which achieve identical accuracy with asymptotically fewer multiplications. Fourth, we present outlier channel splitting, a technique to improve DNN weight quantization by removing outliers from the weight distribution without arduous retraining. Finally, we show preliminary results on overwrite quantization, a technique which address outliers in DNN activation quantization using extremely lightweight architectural extensions to a spatial accelerator template.




Efficient Machine Learning Acceleration at the Edge


Book Description

My thesis is a result of a confluence of several trends that have emerged in recent years. First, the rapid proliferation of deep learning across the application and hardware landscapes is creating an immense demand for computing power. Second, the waning of Moore's Law is paving the way for domain-specific acceleration as a means of delivering performance improvements. Third, deep learning's inherent error tolerance is reviving long-forgotten approximate computing paradigms. Fourth, latency, energy, and privacy considerations are increasingly pushing deep learning towards edge inference, with its stringent deployment constraints. All of the above have created a unique, once-in-a-generation opportunity for accelerated widespread adoption of new classes of hardware and algorithms, provided they can deliver fast, efficient, and accurate deep learning inference within a tight area and energy envelope. One approach towards efficient machine learning acceleration that I have explored attempts to push a neural network model size to its absolute minimum. 3PXNet - Pruned, Permuted, Packed XNOR Networks combines two widely used model compression techniques: binarization and sparsity to deliver usable models with a size down to single kilobytes. It uses an innovative combination of weight permutation and packing to create structured sparsity that can be implemented efficiently in both software and hardware. 3PXNet has been deployed as an open-source library targeting microcontroller-class devices with various software optimizations, further improving runtime and storage requirements. The second line of work I have pursued is the application of stochastic computing (SC). It is an approximate, stream-based computing paradigm enabling extremely area-efficient implementations of basic arithmetic operations such as multiplication and addition. SC has been enjoying a renaissance over the past few years due to its unique synergy with deep learning. On the one hand, SC makes it possible to implement extremely dense multiply-accumulate (MAC) computational fabric well suited towards computing large linear algebra kernels, which are the bread-and-butter of deep neural networks. On the other hand, those neural networks exhibit immense approximation tolerance levels, making SC a viable implementation candidate. However, several issues need to be solved to make the SC acceleration of neural networks feasible. The area efficiency comes at the cost of long stream processing latency. The conversion cost between fixed-point and stochastic representations can cancel out the gains from computation efficiency if not managed correctly. The above issues lead to a question on how to design an accelerator architecture that best takes advantage of SC's benefits and minimizes its shortcomings. To address this, I proposed the ACOUSTIC (Accelerating Convolutional Neural Networks through Or-Unipolar Skipped Stochastic Computing) architecture and its extension - GEO (Generation and Execution Optimized Stochastic Computing Accelerator for Neural Networks). ACOUSTIC is an architecture that tries to maximize SC's compute density to amortize conversion costs and memory accesses, delivering system-level reduction in inference energy and latency. It has taped out and demonstrated in silicon, using a 14nm fabrication process. GEO addresses some of the shortcomings of ACOUSTIC. Through the introduction of near-memory computation fabric, GEO enables a more flexible selection of dataflows. Novel progressive buffering scheme unique to SC lowers the reliance on high memory bandwidth. Overall, my work tries to approach accelerator design from the systems perspective, making it stand apart from most recent SC publications targeting point improvements in the computation itself. As an extension to the above line of work, I have explored the combination of SC and sparsity, to apply it to new classes of applications, and enable further benefits. I have proposed the first SC accelerator that supports weight sparsity - SASCHA (Sparsity-Aware Stochastic Computing Hardware Architecture for Neural Network Acceleration), which can improve performance on pruned neural networks, while maintaining the throughput when processing dense ones. SASCHA solves a series of unique, non-trivial challenges of combining SC with sparsity. On the other hand, I have also designed an architecture for accelerating event-based camera object tracking - SCIMITAR. Event-based cameras are relatively new imaging devices which only transmit information about pixels that have changed in brightness, resulting in very high input sparsity. SCIMITAR combines SC with computing-in-memory (CIM), and, through a series of architectural optimizations, is able to take advantage of this new data format to deliver low-latency object detection for tracking applications.




Efficient Processing of Deep Neural Networks


Book Description

This book provides a structured treatment of the key principles and techniques for enabling efficient processing of deep neural networks (DNNs). DNNs are currently widely used for many artificial intelligence (AI) applications, including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Therefore, techniques that enable efficient processing of deep neural networks to improve key metrics—such as energy-efficiency, throughput, and latency—without sacrificing accuracy or increasing hardware costs are critical to enabling the wide deployment of DNNs in AI systems. The book includes background on DNN processing; a description and taxonomy of hardware architectural approaches for designing DNN accelerators; key metrics for evaluating and comparing different designs; features of DNN processing that are amenable to hardware/algorithm co-design to improve energy efficiency and throughput; and opportunities for applying new technologies. Readers will find a structured introduction to the field as well as formalization and organization of key concepts from contemporary work that provide insights that may spark new ideas.




Data-constrained Model Compression


Book Description

In recent years, strong progress has been made in compressing compute-heavy machine learning models to enable them to execute in real-time on edge devices. Typically, model compression techniques require retraining a model on the original dataset of interest. This is problematic if the original dataset is unavailable due to privacy or legal concerns, or if the model to be compressed was obtained from a third party. We explore the challenges associated with compressing a model in three different data-constrained scenarios. In the first scenario, labels are unavailable. We approach this problem through knowledge distillation, training a smaller model using predictions made from a larger model on unlabeled data. In the second scenario, both data and labels are unavailable. We approach this problem by separately compressing every layer of a pretrained model to obtain a compressed approximation of the original model. Our method is computationally efficient, achieving strong compression rates while maintaining accuracy. In the third scenario, we explore the problem of dynamic, real-time compression after model deployment. We demonstrate a training technique in which we condition a model to achieve high accuracy across a variety of compression levels, allowing for efficient, real-time model selection along the efficiency-accuracy trade-off curve after model deployment. We present these works to elucidate the challenges associated with data-constrained model compression, and to provide solutions for compressing models in these challenging scenarios.




Efficient Deep Learning


Book Description

Modern machine learning often relies on deep neural networks that are prohibitively expensive in terms of the memory and computational footprint. This in turn significantly inhibits the potential range of applications where we are faced with non-negligible resource constraints, e.g., real-time data processing, embedded devices, and robotics. In this thesis, we develop theoretically-grounded algorithms to reduce the size and inference cost of modern, large-scale neural networks. By taking a theoretical approach from first principles, we intend to understand and analytically describe the performance-size trade-offs of deep networks, i.e., the generalization properties. We then leverage such insights to devise practical algorithms for obtaining more efficient neural networks via pruning or compression. Beyond theoretical aspects and the inference time efficiency of neural networks, we study how compression can yield novel insights into the design and training of neural networks. We investigate the practical aspects of the generalization properties of pruned neural networks beyond simple metrics such as test accuracy. Finally, we show how in certain applications pruning neural networks can improve the training and hence the generalization performance.




Model Compression and AutoML for Efficient Click-through Rate Prediction


Book Description

Novel machine learning architectures can adeptly learn to predict user response for recommender systems. However, these model architectures are often effective at the cost of large computational, and memory, cost. This limits their ability to run on edge devices with smaller hardwares, such as smartphones, which is a popular use case for recommender systems. We address this issue in this thesis by studying how compression of recommender system models can significantly reduce model computation cost, and edge device runtime, while preserving prediction accuracy. Furthermore, we present a new compression-based AutoML method for feature set generation in architectures which incorporate explicit feature interactions. This works as a tool to build efficient recommender system models, and is applicable to many state of the art model designs. Applying this AutoML shows initial gains in model performance.




Information Theory, Inference and Learning Algorithms


Book Description

Information theory and inference, taught together in this exciting textbook, lie at the heart of many important areas of modern technology - communication, signal processing, data mining, machine learning, pattern recognition, computational neuroscience, bioinformatics and cryptography. The book introduces theory in tandem with applications. Information theory is taught alongside practical communication systems such as arithmetic coding for data compression and sparse-graph codes for error-correction. Inference techniques, including message-passing algorithms, Monte Carlo methods and variational approximations, are developed alongside applications to clustering, convolutional codes, independent component analysis, and neural networks. Uniquely, the book covers state-of-the-art error-correcting codes, including low-density-parity-check codes, turbo codes, and digital fountain codes - the twenty-first-century standards for satellite communications, disk drives, and data broadcast. Richly illustrated, filled with worked examples and over 400 exercises, some with detailed solutions, the book is ideal for self-learning, and for undergraduate or graduate courses. It also provides an unparalleled entry point for professionals in areas as diverse as computational biology, financial engineering and machine learning.




Compression, Generation, and Inference Via Supervised Learning


Book Description

Artificial intelligence and machine learning methods have seen tremendous advances in the past decade, thanks to deep neural networks. Supervised learning methods enables neural networks to effectively approximate low-level functions of human intelligence, such as identifying an object within an image. However, many complex functions of human intelligence are difficult to solve with supervised learning directly: humans can build concise representations of the world (compression), generate works of art based on creative imaginations (generation), and infer how others will act from personal experiences (inference). In this dissertation, we focus on machine learning approaches that reduce these complex functions of human intelligence into simpler ones that can be readily solved with supervised learning and thus enabling us to leverage the developments in deep learning. This dissertation comprises of three parts, namely compression, generation, and inference. The first part discusses how we can apply supervised learning to unsupervised representation learning. We develop algorithms that can learn informative representations from large unlabeled datasets while protecting certain sensitive attributes. The second part extends these ideas to learning high-dimensional probabilistic models of unlabeled data. Combined with the insights from the first part, we introduce a generative model suitable for conditional generation under limited supervision. In the third and final part, we present two applications of supervised learning in probabilistic inference methods: (a) optimizing for efficient Bayesian inference algorithms and (b) inferring the agents' intent under complex, multi-agent environments. These contributions enable machines to overcome existing limitations of supervised learning in real-world compression, generation, and inference problems.




Deep Learning Applications, Volume 2


Book Description

This book presents selected papers from the 18th IEEE International Conference on Machine Learning and Applications (IEEE ICMLA 2019). It focuses on deep learning networks and their application in domains such as healthcare, security and threat detection, fault diagnosis and accident analysis, and robotic control in industrial environments, and highlights novel ways of using deep neural networks to solve real-world problems. Also offering insights into deep learning architectures and algorithms, it is an essential reference guide for academic researchers, professionals, software engineers in industry, and innovative product developers.