Book Description
As machine learning rapidly progresses, convolutional neural networks (CNN) have emerged as a successful although computationally intensive approach, in part due to their ability to recognize spatial features. The main computation in these CNNs is the multiply-and-accumulate (MAC) operation, in which two matrices are multiplied together element wise and summed, corresponding to the Frobenius inner product of the two matrices. Because of this, an increase in efficiency in the MAC operation will significantly increase the efficiency of these networks, making it crucial to design the MAC engine efficiently. This thesis explores a near-memory timedomain multiply-and-accumulate (MAC) engine used for convolutional neural networks. Time domain computing is chosen for efficiency as it allows for compact representation of multi bit inputs within a single wire. This reduces the gate count and switching capacitance (Cdyn) within the arithmetic circuit compared to an all-digital implementation. The input features are encoded in time by modulating the pulse width of the feature signal. A delay line digital-to-time converter (DTC) is used to generate these encoded input features. Local static random-access memory (SRAM) is used to store weights, which are then used to gate the input feature pulses. The gated product is then passed to a proposed digitally controlled gated ring oscillator (DCGRO) time-todigital converter (TDC). The DCGRO TDC functions as a time accumulator, as partial pulses are stored within the DCGRO, and quantized pulses are tracked in the counter. Because of the digital control, the DCGRO is able to switch between two operating frequencies, allowing quantization of two pulses in parallel. To speed up the accumulation, partial sums are accumulated and summed together in the digital domain. To support signed accumulation, two time accumulators are used, and products are switched between the two depending on the sign of the weight from memory. The proposed design is implemented in a 28 nm process. For 5-bit input precision, the proposed design iii achieves an energy efficiency of 4.6 TOPS/W and a throughput of 819 GOPS/s at 900 mV. For 8- bit input precision, the power efficiency is estimated to be 854 GOPS/W, and the throughput is estimated to be 102 GOPS/s.