Multi-node Acceleration for Large-scale GCNs
Gongjian Sun,Mingyu Yan,Duo Wang,Han Li,Wenming Li,Xiaochun Ye, DongruiFan,(Institute of Computing Technology,Chinese Academy of Sciences, Beijing,China) andYuan Xie (University of California,Santa Barbara,California,USA).
The paper Multi-node Acceleration for Large-scale GCNs submitted by ICT’s High-Throughput Computer Research Center was accepted by IEEE Transaction on Computers for the publication’s special issue on Hardware Acceleration of Machine Learning. This IEEE publication, a CCF-A tier journal, is a top-tier international journal on computer science.
Graph Convolutional Neural Networks (GCNs) emerged as a premier paradigm to address the graph learning problem via generalizing the information encoding to graph topologies representing complex relationships. GCNs subsequently have been widely applied. Critical fields such as knowledge inference, visual reasoning, traffic prediction, and EDA are GCN workloads at many data centers. Because GCNs impose challenges when processed on modern architectures due to the hybrid execution pattern, many novel high-performance and energy-efficient GCN accelerators have been proposed. Yet due to explosive graphs, limited by memory capacity and computation power, single-node accelerators can neither hold the graph data itself nor complete the execution of GCNs within a reasonable time. Thus, large-scale GCNs call for a multi-node acceleration system (MultiAccSys).
As an initial part of Graph Neural Network (GNN ) processor cluster design, ICT’s High-throughput Research Center (HTCRC) researched a multi-node acceleration solution for large-scale GCNs. After characterizing the execution pattern of large-scale GCNs on multi-node acceleration systems (MultiAccSys) and observing that (1) there are irregular coarse-grained communication patterns that invoke massive redundant transmissions and off-chip memory accesses; and (2) execution of GCNs in MultiAccSys is mainly bandwidth-bound and latency-tolerant, we proposed MultiGCN. It is the first MultiAccSys for large-scale GCNs that trades network latency for network bandwidth, as shown in Figure 1. By leveraging the network latency tolerance, we propose a topology-aware multicast mechanism with a one put per multicast message-passing model to reduce transmissions and alleviate network bandwidth requirements, as shown in Figure 2. Next, we introduce a scatter-based round execution mechanism which cooperates with the multicast mechanism and reduces redundant off-chip memory accesses, as shown in Figure 3.
Compared to the baseline MultiAccSys, MultiGCN achieves 4～12× speedup using only 28%～68% energy, while reducing 32% transmissions and 73% off-chip memory accesses on average. It not only achieves 2.5～8× speedup over the state-of-the-art multi-GPU solution but also scales to large-scale graphs, as opposed to single-node GCN accelerators.
Figure 1 MultiGNC is the first MultiACCSYs for large-scale GCNs that trades network latency for network bandwidth
Figure 2 Proposed topology-aware multicast mechanism
Figure 3 Introduction of a scatter-based round execution mechanism
The High-throughput Computer Research Center focuses on the novel computer architecture for high-throughput data processing in the era of artificial intelligence and the Internet of Everything. HTCRC constantly promotes research and gains a strong record of accomplishments in high-throughput processors, high-throughput interconnect technologies, superconducting computers, and algorithms/applications for high-throughput computing, among others. HTCRC continues to publish papers in top conferences and journals of computer architecture and system, including MICRO, HPCA, and IEEE TC. HTCRC’s prospective research has exerted extensive influence and obtained practical applications domestically and abroad.