Four Papers of State Key Laboratory of Processors were Accepted by MICRO 2024

Date： Nov 28, 2024 | Size: 【 A A A 】 | 【Print】

Four papers titled "Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM", "Cambricon-M: a Fibonacci-coded Charge-domain SRAM-based CIM Accelerator for DNN Inference", "Cambricon-C: Efficient 4-bit Matrix Unit via Primitivization", and "TMiner: A Vertex-Based Task Scheduling Architecture for Graph Pattern Mining", authored by researchers from the State Key Laboratory of Processors, Institute of Computing Technology, CAS (referred as laboratory), were accepted by MICRO 2024 (IEEE/ACM International Symposium on Microarchitecture, CCF-A), one of the top-tier international conference on computer architecture.

The first author of the paper "Cambricon-LLM: A C iplet-Based Hybrid Architecture for On-Device Inference of 70B LLM" is Zhongkai Yu, a master student from the laboratory. Cambricon-LLM is the first accelerator architecture to support the deployment of large language models with 70 billion parameters on edge devices. Deploying advanced large language models on edge devices such as smartphones and robots enhances user data privacy and reduces dependence on network connectivity. However, this task is characterized by small computation batches (typically 1) and low computational intensity, presenting dual challenges of memory usage and bandwidth demand to limited edge resources. To address these issues, Cambricon-LLM has designed a hybrid architecture based on chiplet technology, integrating an NPU and a dedicated NAND flash memory chip. Firstly, Cambricon-LLM utilizes the high storage density of NAND flash chips to store model weights, and has enhanced the flash chips with die-internal computation and error correction capabilities to alleviate bandwidth pressure while maintaining inference accuracy. Secondly, Cambricon-LLM employs hardware-aware slicing techniques, allowing the NPU and flash chips to cooperate in matrix computations, achieving rational utilization of hardware resources. Overall, Cambricon-LLM is 22 to 45 times faster than existing flash offloading technologies, demonstrating the potential for deploying powerful LLMs on edge devices.

Fig. 1. Cambricon-LLM architecture

The first author of the paper "Cambricon-M: a Fibonacci-coded Charge-domain SRAM-based CIM Accelerator for DNN Inference" is Hongrui Guo, a PhD student from the laboratory. Cambricon-M is a Fibonacci-coding based charge-domain SRAM Computing-in-memory (CIM) architecture. CIM architectures reduce the data movement between computing units and memory, thus having broad application prospects. Among them, the charge-domain SRAM CIM architectures have become a research hotspot in recent years. Our study finds that the input voltages of the analog-to-digital converters (ADCs) (i.e., the output voltages of the SRAM arrays) have a wide dynamic range, requiring high-resolution ADCs to convert the high-resolution analog voltages into high-bitwidth digital data to avoid accuracy loss. However, the high-resolution ADCs have gradually become an energy bottleneck (accounting for up to 64%), limiting the development of the charge-domain SRAM CIM architectures. Cambricon-M uses Fibonacci-coding to ensure that the two adjacent bits of each ‘1’ in the inputs are both ‘0’, reducing the density of ‘1’ in the operands to narrow the range of the SRAM array output voltages (i.e., the ADC input voltages), thereby enabling lower-resolution ADCs in charge domain SRAM CIM architectures, reducing energy consumption and accuracy loss. In addition, Cambricon-M uses a variety of bit-level sparsity techniques to address the extra energy and area overhead introduced by high-bitwidth in Fibonacci-coding. Experimental results show that Cambricon-M reduces ADC energy by 68.7% and achieves energy efficiency improvements of 3.48x and 1.62x over existing digital domain and charge domain accelerators, respectively.

Fig. 2. Cambricon-M architecture

The first author of the paper " Cambricon-C: Efficient 4-bit Matrix Unit via Primitivization " is Yi Chen, a master student from the laboratory. Cambricon-C is an accelerator design for handling low-precision matrix multiplication. To cope with the increasing model size, deep neural networks (DNNs) are often quantized to low-precision data before deployment. However, with the decrease of data bit width, the power efficiency return of traditional matrix unit (e.g. multiplication and addition (MAC)) is diminishing. It is because of repeated data computation and high-bit width numerical addition. To overcome this problem, Cambricon-C proposes primitivized matrix multiplication (PMM). PMM reduces the multiplication and addition operation to 1-ary operation -- counting, which avoids the repeating multiplication operations and reducing the strength of addition. Furthermore, Cambricon-C utilizes quarter square multiplication to optimize the hardware design, and obtaining practical benefits. The Cambricon-C systolic array has approximately 1.95x power saving compared with the MAC based systolic array. The complete Cambricon-C accelerator has 1.13-1.25x power saving compared with the traditional TPU accelerator ranging from various DNNs.

Fig. 3. A primitived matrix multiplication hardware implementation with 256 counters based on: 1. output-stationary systolic array; 2-5. Each pair of activations and weights triggers a counter to be accumulated; 6. Calculate the final result from the counting value of counters and the precomputed lookup table.

The first author of the paper "TMiner-A Vertex-Based Task Scheduling Architecture for Graph Pattern Mining" is Zerun Li, a Ph.D student from the laboratory. TMiner is a graph pattern mining (GPM) accelerator based on near-memory computing. GPM is used to search for subgraphs that are isomorphic to a given pattern. It is widely used in fields such as bioinformatics, network security, and social network analysis. Existing GPM systems suffer from multiple levels of redundant memory access: 1) Multiple computing units access the same data simultaneously, resulting in redundant DRAM or Last Level Cache access and reduction in effective cache capacity; 2) The same computing units repeatedly accesses the same data at discrete clocks, resulting in redundant DRAM access; 3) A computing task splits the access to a complete set into accesses to multiple subsets, resulting in redundant DRAM access and destroying the continuity of memory access. TMiner performs fine-grained task definitions based on memory access behavior in the design stage. This definition, combined with task partitioning strategy, ensures the memory accesses of different computing units non-intersecting, eliminating redundant memory access between computing units. In the compilation stage, the accesses to multiple subsets are merge into the access to the superset, eliminating redundant memory accesses within the computing task; At runtime, TMiner dynamically merge tasks with the same data requirements, eliminating redundant memory access within the computing unit. Evaluations on social network, citation network and other datasets show that TMiner can reduce redundant memory access by 90% and achieve an average performance improvement of 3.5 times.

Fig. 4. TMiner architecture. (a) Overall design, (b) PE architecture, (c) Controller design, (d) Task queue design, (e) Computing unit design.

In addition, the paper "SRender: Boosting Neural Radiance Field Efficiency via Sensitivity-Aware Dynamic Precision Rendering", co-authored by the researchers from laboratory and Shanghai Jiao Tong University was also accepted by MICRO. SRender is the first work to migrate the adaptive rendering technology used in traditional image rendering to neural rendering. SRender proposes a variable precision quantization solution based on data sensitivity and a data reordering mechanism combining coarse and fine granularity to address the plate conflict problem, achieving a real-time rendering effect of 56FPS.

MICRO conference mainly collects research progress in the fields of microarchitecture, compilers, chips and systems. Since its first meeting in 1968, MICRO has become a top conference in the field of computer architecture and one of the most important academic conferences in the field of computer architecture in the world. It has played a vital role in promoting research and development in this field. MICRO 2025 received a total of 497 submissions, 113 of which were accepted, with an acceptance rate of 22.7%.

downloadFile