Researchers from the State Key Lab of Processors in the Institute of Computing Technology of Chinese Academy of Sciences (ICT-CAS), Weile Jia, Zhuoqiang Guo and Guangming Tan, have made progress in large-scale direct sparse solver, achieving a peak performance of 64 PFLOPS (5% efficiency of peak) on the domestically developed new Sunway supercomputer. The optimized solver, can scale up to the entire Exascale supercomputer, and has reached unprecedented computational efficiency as compared to current state-of-the-art methods. Computational speed increased by three orders of magnitude, promoting for the first time the simulation of complex metallic heterostructures to 2.5 million atoms. An important step in the field of domain-specific sparse matrix solvers, this work provides insight for the codesign of post-Exascale supercomputers. The collaborative paper, "2.5 million-atom ab initio electronic-structure simulation of complex metallic heterostructures with DGDFT," was among the finalists for the 2022 Gordon Bell Prize at the International Conference for High-Performance Computing, Networking, Storage, and Analysis (SC22),
Although supercomputers have peak performance of EFLOPS (1018 floating-point operations/sec), and the performance of dense matrix floating-point calculations such as LINPACK has correspondingly increased, the growth of FLOPS/byte ratio has slowed. The resulting gap between computing performance and memory access bandwidth presents challenges for sparse matrix calculations. For example, the computing efficiency of the HPCG sparse matrix evaluation program for supercomputers worldwide is generally below 3%, with the highest record to-date achieved by the Fugaku supercomputer in Kobe, Japan, at 16 PFLOPS. External conditions and technological issues generally limit the HPCG performance of domestically developed supercomputers to lower than 1%. Key challenges are in low indirect memory access, memory bandwidth, and cache hit rate.
The high-performance large-scale direct sparse matrix solver has took advantage of the block-sparse matrix format of the Hamiltonian matrices. Using this data structure, an efficient algorithm was implemented on the new Sunway supercomputer, which enhanced program locality, reduced memory access, converted indirect memory access to direct memory access, and improved computational parallelism. Another improvement was seen in effectively alleviating issues with the new Sunway’s low memory bandwidth and high memory access latency. Large-scale network communication was optimized by combining network architecture characteristics. The resulting large-scale sparse matrix solver achieved 5% of peak performance (64PFLOPS) on nearly 100,000 nodes of the new Sunway supercomputer. Simulation of 2.5 million-atom complex metallic heterostructures was realized by combining first-principle calculation software DGDFT with the direct sparse matrix solver, as detailed in the team’s paper.
This successful research provides a new approach for designing domain-specific large-scale parallel solvers from the application side for domestic supercomputers. Corresponding authors for this Gordon Bell Prize-winning paper are Prof. An Hong, Associate Professor Jia Weile, and Prof. Yang Jinlong. Professor Hu Wei, PhD students Guo Zhuoqiang and Jiang Qingcai, and ICT Associate Professor Qin Xinming are the co-first authors. This research was jointly funded by the National Natural Science Foundation, the National Key R&D Program, and other projects.