Large-scale Graph Analysis: System, Algorithm and Optimization


Book Description

This book introduces readers to a workload-aware methodology for large-scale graph algorithm optimization in graph-computing systems, and proposes several optimization techniques that can enable these systems to handle advanced graph algorithms efficiently. More concretely, it proposes a workload-aware cost model to guide the development of high-performance algorithms. On the basis of the cost model, the book subsequently presents a system-level optimization resulting in a partition-aware graph-computing engine, PAGE. In addition, it presents three efficient and scalable advanced graph algorithms – the subgraph enumeration, cohesive subgraph detection, and graph extraction algorithms. This book offers a valuable reference guide for junior researchers, covering the latest advances in large-scale graph analysis; and for senior researchers, sharing state-of-the-art solutions based on advanced graph algorithms. In addition, all readers will find a workload-aware methodology for designing efficient large-scale graph algorithms.




Distributed Graph Analytics


Book Description

This book brings together two important trends: graph algorithms and high-performance computing. Efficient and scalable execution of graph processing applications in data or network analysis requires innovations at multiple levels: algorithms, associated data structures, their implementation and tuning to a particular hardware. Further, programming languages and the associated compilers play a crucial role when it comes to automating efficient code generation for various architectures. This book discusses the essentials of all these aspects. The book is divided into three parts: programming, languages, and their compilation. The first part examines the manual parallelization of graph algorithms, revealing various parallelization patterns encountered, especially when dealing with graphs. The second part uses these patterns to provide language constructs that allow a graph algorithm to be specified. Programmers can work with these language constructs without worrying about their implementation, which is the focus of the third part. Implementation is handled by a compiler, which can specialize code generation for a backend device. The book also includes suggestive results on different platforms, which illustrate and justify the theory and practice covered. Together, the three parts provide the essential ingredients for creating a high-performance graph application. The book ends with a section on future directions, which offers several pointers to promising topics for future research. This book is intended for new researchers as well as graduate and advanced undergraduate students. Most of the chapters can be read independently by those familiar with the basics of parallel programming and graph algorithms. However, to make the material more accessible, the book includes a brief background on elementary graph algorithms, parallel computing and GPUs. Moreover it presents a case study using Falcon, a domain-specific language for graph algorithms, to illustrate the concepts.




Improving Distributed Graph Processing by Load Balancing and Redundancy Reduction


Book Description

The amount of data generated every day is growing exponentially in the big data era. A significant portion of this data is stored as graphs in various domains, such as online retail and social networks. Analyzing large-scale graphs provides important insights that are highly utilized in many areas, such as recommendation systems, banking systems, and medical diagnosis. To accommodate analysis on large-scale graphs, developers from industry and academia design the distributed graph processing systems. However, processing graphs in a distributed manner suffers performance inefficiencies caused by workload imbalance and redundant computations. For instance, while data centers are trending towards a large amount of heterogeneous processing machines, graph partitioners still operate under the assumption of uniform computing resources. This leads to load imbalance that degrades the overall performance. Even with a balanced data distribution, the irregularity of graph applications can result in different amounts of dynamic operations on each machine in the cluster. Such imbalanced work distribution slows down the execution speed. Besides these, redundancy also impacts the performance of distributed graph analysis. To utilize the available parallelism of computing clusters, distributed graph systems deploy the repeated-relaxing computation model (e.g., Bellman-Ford algorithm variants) rather than in a sequential but work-optimal order. Studies performed in this dissertation show that redundant computations pervasively exist and significantly impact the performance efficiency of distributed graph processing. This dissertation explores novel techniques to reduce the workload imbalance and redundant computations of analyzing large-scale graphs in a distributed setup. It evaluates proposed techniques on both pre-processing and execution modules to enable fair data distribution, lightweight workload balancing, and redundancy optimization for future distributed graph processing systems. The first contribution of this dissertation is the Heterogeneity-aware Partitioning (HAP) that aims to balance load distribution of distributed graph processing in heterogeneous clusters. HAP proposes a number of methodologies to estimate various machines’ computational power on graph analytics. It also extends several state-of-the-art partitioning algorithms for heterogeneity-aware data distribution. Utilizing the estimation of machines’ graph processing capability to guide extended partitioning algorithms can reduce load imbalance when processing a large-scale graph in heterogeneous clusters. This results in significant performance improvement. Another contribution of the dissertation is the Hula, which optimizes the workload balance of distributed graph analytics on the fly. Hula offers a hybrid graph partitioning algorithm to split a large-scale graph in a locality-friendly manner and generate metadata for lightweight dynamic workload balancing. To track machines’ work intensity, Hula inserts hardware timers to count the time spent on the important operations (e.g., computational operations and atomic operations). This information can guide Hula’s workload scheduler to arrange work migration. With the support of metadata generated by the hybrid partitioner, Hula’s migration scheme only requires a minimal amount of data to transfer work between machines in the cluster. Hula’s workload balancing design achieves a lightweight imbalance reduction on the fly. Finally, this dissertation focuses on improving the computational efficiency of distributed graph processing. To do so, it reveals the root cause and the amount of redundant computations in distributed graph processing. SLFE is proposed as a system solution to reduce these redundant operations. SLFE develops a lightweight pre-processing technique to obtain the maximum propagation order of each vertex in a given graph. This information is defined as Redundancy Reduction Guidance (RRG) and is utilized by SLFE’s Redundancy Reduction (RR)-aware computing model to prune redundant operations on the fly. Moreover, SLFE provides RRaware APIs to maintain high promgrammablity. These techniques allow the redundancy optimizations of distributed graph processing to become transparent to users




PPDQ-BG


Book Description

In recent years, there has been an explosive growth of the linked data of a global information space that often requires expensive computations to perform big graph analysis and query processing. Graph data represent irregular and unstructured relationships that usually result in a lack of locality so that it is often difficult to extract relevant information from big graphs. Although there have been advances in graph processing in centralized, as well as distributed environments, there was a lack of an efficient method of handling large scale graphs for the task of finding relevant relations from big graphs or partitioning a big graph into several meaningful inter-connected graph partitions. In this thesis, we propose a scalable framework, “Parallel Partition and Distributed Query Processing for Big Graphs” (PPDQ-BG) that aims to achieve a parallel partition of large scale RDF graph and distributed query processing on the partitioned data. In this thesis, we propose a PPDQ-BG framework to make the previous centralized approach, “A Big Graph Analytics Framework for Knowledge Discovery”, a distributed model by proposing new partitioning algorithms. The proposed framework also has parallel computation of relevant information by determining neighborhood relationships among predicates of large RDF graphs in a parallel manner by computing the similarity of predicates in large scale graphs. It has the design of parallel algorithms for partitioning of graphs in a distributed dataflow framework by proposing new clustering algorithms, Similarity based Fuzzy C-Means Partitioning and Hierarchical Predicate Based Clustering. The framework also has implementation of an interactive tool using Neo4j graph databases for executing a distributed query process from partitioned graphs, experimental evaluations including correlation coefficient matrix evaluations to validate the proposed framework and comparison of proposed partitioning methods with existing partitioning algorithms using multiple datasets including medical ontology datasets, DBPedia, YAGO, and Bio2RDF datasets, experimental results of distributed query processing for efficient data retrieval for various complex queries against large scale datasets including DBPedia, YAGO, and Bio2RDF datasets.




Graph Partitioning


Book Description

Graph partitioning is a theoretical subject with applications in many areas, principally: numerical analysis, programs mapping onto parallel architectures, image segmentation, VLSI design. During the last 40 years, the literature has strongly increased and big improvements have been made. This book brings together the knowledge accumulated during many years to extract both theoretical foundations of graph partitioning and its main applications.




Algorithm Engineering


Book Description

Algorithm Engineering is a methodology for algorithmic research that combines theory with implementation and experimentation in order to obtain better algorithms with high practical impact. Traditionally, the study of algorithms was dominated by mathematical (worst-case) analysis. In Algorithm Engineering, algorithms are also implemented and experiments conducted in a systematic way, sometimes resembling the experimentation processes known from fields such as biology, chemistry, or physics. This helps in counteracting an otherwise growing gap between theory and practice.




Compiler and System for Resilient Distributed Heterogeneous Graph Analytics


Book Description

Graph analytics systems are used in a wide variety of applications including health care, electronic circuit design, machine learning, and cybersecurity. Graph analytics systems must handle very large graphs such as the Facebook friends graph, which has more than a billion nodes and 200 billion edges. Since machines have limited main memory, distributed-memory clusters with sufficient memory and computation power are required for processing of these graphs. In distributed graph analytics, the graph is partitioned among the machines in a cluster, and communication between partitions is implemented using a substrate like MPI. However, programming distributed-memory systems are not easy and the recent trend towards the processor heterogeneity has added to this complexity. To simplify the programming of graph applications on such platforms, this dissertation first presents a compiler called Abelian that translates shared-memory descriptions of graph algorithms written in the Galois programming model into efficient code for distributed-memory platforms with heterogeneous processors. An important runtime parameter to the compiler-generated distributed code is the partitioning policy. We present an experimental study of partitioning strategies for distributed work-efficient graph analytics applications on different CPU architecture clusters at large scale (up to 256 machines). Based on the study we present a simple rule of thumb to select among myriad policies. Another challenge of distributed graph analytics that we address in this dissertation is to deal with machine fail-stop failures, which is an important concern especially for long-running graph analytics applications on large clusters. We present a novel communication and synchronization substrate called Phoenix that leverages the algorithmic properties of graph analytics applications to recover from faults with zero overheads during fault-free execution and show that Phoenix is 24x faster than previous state-of-the-art systems. In this dissertation, we also look at the new opportunities for graph analytics on massive datasets brought by a new kind of byte-addressable memory technology with higher density and lower cost than DRAM such as intel Optane DC Persistent Memory. This enables the design of affordable systems that support up to 6TB of randomly accessible memory. In this dissertation, we present key runtime and algorithmic principles to consider when performing graph analytics on massive datasets on Optane DC Persistent Memory as well as highlight ideas that apply to graph analytics on all large-memory platforms. Finally, we show that our distributed graph analytics infrastructure can be used for a new domain of applications, in particular, embedding algorithms such as Word2Vec. Word2Vec trains the vector representations of words (also known as word embeddings) on large text corpus and resulting vector embeddings have been shown to capture semantic and syntactic relationships among words. Other examples include Node2Vec, Code2Vec, Sequence2Vec, etc (collectively known as Any2Vec) with a wide variety of uses. We formulate the training of such applications as a graph problem and present GraphAny2Vec, a distributed Any2Vec training framework that leverages the state-of-the-art distributed heterogeneous graph analytics infrastructure developed in this dissertation to scale Any2Vec training to large distributed clusters. GraphAny2Vec also demonstrates a novel way of combining model gradients during training, which allows it to scale without losing accuracy




Distributed Graph Analytics


Book Description

This book brings together two important trends: graph algorithms and high-performance computing. Efficient and scalable execution of graph processing applications in data or network analysis requires innovations at multiple levels: algorithms, associated data structures, their implementation and tuning to a particular hardware. Further, programming languages and the associated compilers play a crucial role when it comes to automating efficient code generation for various architectures. This book discusses the essentials of all these aspects. The book is divided into three parts: programming, languages, and their compilation. The first part examines the manual parallelization of graph algorithms, revealing various parallelization patterns encountered, especially when dealing with graphs. The second part uses these patterns to provide language constructs that allow a graph algorithm to be specified. Programmers can work with these language constructs without worrying about their implementation, which is the focus of the third part. Implementation is handled by a compiler, which can specialize code generation for a backend device. The book also includes suggestive results on different platforms, which illustrate and justify the theory and practice covered. Together, the three parts provide the essential ingredients for creating a high-performance graph application. The book ends with a section on future directions, which offers several pointers to promising topics for future research. This book is intended for new researchers as well as graduate and advanced undergraduate students. Most of the chapters can be read independently by those familiar with the basics of parallel programming and graph algorithms. However, to make the material more accessible, the book includes a brief background on elementary graph algorithms, parallel computing and GPUs. Moreover it presents a case study using Falcon, a domain-specific language for graph algorithms, to illustrate the concept s.




Handbook of Big Data Technologies


Book Description

This handbook offers comprehensive coverage of recent advancements in Big Data technologies and related paradigms. Chapters are authored by international leading experts in the field, and have been reviewed and revised for maximum reader value. The volume consists of twenty-five chapters organized into four main parts. Part one covers the fundamental concepts of Big Data technologies including data curation mechanisms, data models, storage models, programming models and programming platforms. It also dives into the details of implementing Big SQL query engines and big stream processing systems. Part Two focuses on the semantic aspects of Big Data management including data integration and exploratory ad hoc analysis in addition to structured querying and pattern matching techniques. Part Three presents a comprehensive overview of large scale graph processing. It covers the most recent research in large scale graph processing platforms, introducing several scalable graph querying and mining mechanisms in domains such as social networks. Part Four details novel applications that have been made possible by the rapid emergence of Big Data technologies such as Internet-of-Things (IOT), Cognitive Computing and SCADA Systems. All parts of the book discuss open research problems, including potential opportunities, that have arisen from the rapid progress of Big Data technologies and the associated increasing requirements of application domains. Designed for researchers, IT professionals and graduate students, this book is a timely contribution to the growing Big Data field. Big Data has been recognized as one of leading emerging technologies that will have a major contribution and impact on the various fields of science and varies aspect of the human society over the coming decades. Therefore, the content in this book will be an essential tool to help readers understand the development and future of the field.