Book Description
In the era of big data, one of the most significant research areas is cluster computing for large-scale data processing. Many cluster computing frameworks and cluster resource management schemes were recently developed to satisfy the increasing demands on large volume data processing. Among them, Apache Hadoop became the de facto platform that has been widely adopted in both industry and academia due to its prominent features such as scalability, simplicity and fault tolerance. The original Hadoop platform was designed to closely resemble the MapReduce framework, which is a programming paradigm for cluster computing proposed by Google. Recently, the Hadoop platform has evolved into its second generation, Hadoop YARN, which serves as a unified cluster resource management layer to support multiplexing of different cluster computing frameworks. A fundamental issue in this field is that given limited computing resources in a cluster, how to efficiently manage and schedule the execution of a large number of data processing jobs. Therefore, in this dissertation, we mainly focus on improving system efficiency and performance for cluster computing platforms, i.e., Hadoop MapReduce and Hadoop YARN, by designing the following new scheduling algorithms and resource management schemes. First, we developed a Hadoop scheduler (LsPS), which aims to improve average job response times by leveraging job size patterns of different users to tune resource sharing between users as well as choose a good scheduling policy for each user. We further presented a self-adjusting slot configuration scheme, named TuMM, for Hadoop MapReduce to improve the makespan of batch jobs. TuMM abandons the static and manual slot configurations in the existing Hadoop MapReduce framework. Instead, by using a feedback control mechanism, TuMM dynamically tunes map and reduce slot numbers on each cluster node based on monitored workload information to align the execution of map and reduce phases. The second main contribution of this dissertation lies in the development of new scheduler and resource management scheme for the next generation Hadoop, i.e., Hadoop YARN. We designed a YARN scheduler, named HaSTE, which can effectively reduce the makespan of MapReduce jobs in YARN platform by leveraging the information of requested resources, resource capacities, and dependency between tasks. Moreover, we proposed an opportunistic scheduling scheme to reassign reserved but idle resources to other waiting tasks. The major goal of our new scheme is to improve system resource utilization without incurring severe resource contentions due to resource over provisioning. We implemented all of our resource management schemes in Hadoop MapReduce and Hadoop YARN, and evaluated the effectiveness of these new schedulers and schemes on different cluster systems, including our local clusters and large clusters in cloud computing, such as Amazon EC2. Representative benchmarks are used for sensitivity analysis and performance evaluations. Experimental results demonstrate that our new Hadoop/YARN schedulers and resource management schemes can successfully improve the performance in terms of job response times, job makespan, and system utilization in both Hadoop MapReduce and Hadoop YARN platforms.