Book Description
While modern GPU graph analytics libraries provide usable programming models and good single-node performance, the memory size and the computation power of a single GPU is still too limited for analyzing large graphs. Scaling graph analytics is challenging, however, because of the characteristics of graph applications: irregular computation, their low computation to communication ratios, and limited communication bandwidth on multi-GPU platforms. Addressing these challenges while still maintaining programmability is yet another difficulty. In this work, I target the scalability of graph analytics to multiple GPUs. I begin by targeting multiple GPUs within a single node. Compared to GPU clusters, single-node-multi-GPU platforms are easier to manage and program, but can still act as good development environments for multi-GPU graph processing. My work targets several aspects of multi-GPU graph analytics: the inputs that graph application programmers provide to the multi-GPU framework; how the graph should be distributed across GPUs; the interaction between local computation and remote communication; what and when to communicate; how to combine received and local data; and when the application should stop. I answer these questions by extending the Gunrock graph analytics framework for a single GPU to multiple GPUs, showing that most graph applications scale well in my system. I also show that direction-optimizing breadth-first search (DOBFS) is the most difficult scaling challenge because of its extremely low compute to communication ratio. To address the DOBFS scaling challenge, I demonstrate a DOBFS implementation with efficient graph representation, local computation, and remote communication, based on the idea of separating high- and low-degree vertices. I particularly target communication costs, using global reduction with bit masks on high-degree vertices and point-to-point communication to low-degree vertices. This greatly reduces overall communication cost and results in good DOBFS scaling with log-scale graphs on more than a hundred GPUs in the Sierra early access system (the testing bed for the Sierra Supercomputer). Next, I revisit the design choices I made for the single-node multi-GPU framework in view of recent hardware and software developments, such as better peer GPU access and unified virtual memory. I analyze 9 newly developed complex graph applications for the DARPA HIVE program, implemented in the Gunrock framework, and show a wide range of potential scalabilities. More importantly, the questions of when and how to do communication are more diverse than those in the single-node framework. With this analysis, I conclude that future multi-GPU frameworks, whether single- or multiple-node, need to be more flexible: instead of only communicating at iteration boundaries, they should support a more flexible, general communication model. I also propose other research directions for future heterogeneous graph processing, including asynchronous computation and communication, specialized graph representation, and heterogenous processing.