Resource Management in Cluster Computing Platforms for Large Scale Data Processing


Book Description

In the era of big data, one of the most significant research areas is cluster computing for large-scale data processing. Many cluster computing frameworks and cluster resource management schemes were recently developed to satisfy the increasing demands on large volume data processing. Among them, Apache Hadoop became the de facto platform that has been widely adopted in both industry and academia due to its prominent features such as scalability, simplicity and fault tolerance. The original Hadoop platform was designed to closely resemble the MapReduce framework, which is a programming paradigm for cluster computing proposed by Google. Recently, the Hadoop platform has evolved into its second generation, Hadoop YARN, which serves as a unified cluster resource management layer to support multiplexing of different cluster computing frameworks. A fundamental issue in this field is that given limited computing resources in a cluster, how to efficiently manage and schedule the execution of a large number of data processing jobs. Therefore, in this dissertation, we mainly focus on improving system efficiency and performance for cluster computing platforms, i.e., Hadoop MapReduce and Hadoop YARN, by designing the following new scheduling algorithms and resource management schemes. First, we developed a Hadoop scheduler (LsPS), which aims to improve average job response times by leveraging job size patterns of different users to tune resource sharing between users as well as choose a good scheduling policy for each user. We further presented a self-adjusting slot configuration scheme, named TuMM, for Hadoop MapReduce to improve the makespan of batch jobs. TuMM abandons the static and manual slot configurations in the existing Hadoop MapReduce framework. Instead, by using a feedback control mechanism, TuMM dynamically tunes map and reduce slot numbers on each cluster node based on monitored workload information to align the execution of map and reduce phases. The second main contribution of this dissertation lies in the development of new scheduler and resource management scheme for the next generation Hadoop, i.e., Hadoop YARN. We designed a YARN scheduler, named HaSTE, which can effectively reduce the makespan of MapReduce jobs in YARN platform by leveraging the information of requested resources, resource capacities, and dependency between tasks. Moreover, we proposed an opportunistic scheduling scheme to reassign reserved but idle resources to other waiting tasks. The major goal of our new scheme is to improve system resource utilization without incurring severe resource contentions due to resource over provisioning. We implemented all of our resource management schemes in Hadoop MapReduce and Hadoop YARN, and evaluated the effectiveness of these new schedulers and schemes on different cluster systems, including our local clusters and large clusters in cloud computing, such as Amazon EC2. Representative benchmarks are used for sensitivity analysis and performance evaluations. Experimental results demonstrate that our new Hadoop/YARN schedulers and resource management schemes can successfully improve the performance in terms of job response times, job makespan, and system utilization in both Hadoop MapReduce and Hadoop YARN platforms.




Resource Management for Big Data Platforms


Book Description

Serving as a flagship driver towards advance research in the area of Big Data platforms and applications, this book provides a platform for the dissemination of advanced topics of theory, research efforts and analysis, and implementation oriented on methods, techniques and performance evaluation. In 23 chapters, several important formulations of the architecture design, optimization techniques, advanced analytics methods, biological, medical and social media applications are presented. These chapters discuss the research of members from the ICT COST Action IC1406 High-Performance Modelling and Simulation for Big Data Applications (cHiPSet). This volume is ideal as a reference for students, researchers and industry practitioners working in or interested in joining interdisciplinary works in the areas of intelligent decision systems using emergent distributed computing paradigms. It will also allow newcomers to grasp the key concerns and their potential solutions.







Adaptive Resource Management and Scheduling for Cloud Computing


Book Description

This book constitutes the thoroughly refereed post-conference proceedings of the Second International Workshop on Adaptive Resource Management and Scheduling for Cloud Computing, ARMS-CC 2015, held in Conjunction with ACM Symposium on Principles of Distributed Computing, PODC 2015, in Donostia-San Sebastián, Spain, in July 2015. The 12 revised full papers, including 1 invited paper, were carefully reviewed and selected from 24 submissions. The papers have identified several important aspects of the problem addressed by ARMS-CC: self-* and autonomous cloud systems, cloud quality management and service level agreement (SLA), scalable computing, mobile cloud computing, cloud computing techniques for big data, high performance cloud computing, resource management in big data platforms, scheduling algorithms for big data processing, cloud composition, federation, bridging, and bursting, cloud resource virtualization and composition, load-balancing and co-allocation, fault tolerance, reliability, and availability of cloud systems.




Transactions on Large-Scale Data- and Knowledge-Centered Systems XLIV


Book Description

The LNCS journal Transactions on Large-Scale Data- and Knowledge-Centered Systems focuses on data management, knowledge discovery, and knowledge processing, which are core and hot topics in computer science. Since the 1990s, the Internet has become the main driving force behind application development in all domains. An increase in the demand for resource sharing (e.g., computing resources, services, metadata, data sources) across different sites connected through networks has led to an evolution of data- and knowledge-management systems from centralized systems to decentralized systems enabling large-scale distributed applications providing high scalability. This, the 44th issue of Transactions on Large-Scale Data- and Knowledge-Centered Systems, contains six fully revised and extended papers selected from the 35th conference on Data Management – Principles, Technologies and Applications, BDA 2019. The topics covered include big data, graph data streams, workflow execution in the cloud, privacy in crowdsourcing, secure distributed computing, machine learning, and data mining for recommendation systems.




An Architecture for Fast and General Data Processing on Large Clusters


Book Description

The past few years have seen a major change in computing systems, as growing data volumes and stalling processor speeds require more and more applications to scale out to clusters. Today, a myriad data sources, from the Internet to business operations to scientific instruments, produce large and valuable data streams. However, the processing capabilities of single machines have not kept up with the size of data. As a result, organizations increasingly need to scale out their computations over clusters. At the same time, the speed and sophistication required of data processing have grown. In addition to simple queries, complex algorithms like machine learning and graph analysis are becoming common. And in addition to batch processing, streaming analysis of real-time data is required to let organizations take timely action. Future computing platforms will need to not only scale out traditional workloads, but support these new applications too. This book, a revised version of the 2014 ACM Dissertation Award winning dissertation, proposes an architecture for cluster computing systems that can tackle emerging data processing workloads at scale. Whereas early cluster computing systems, like MapReduce, handled batch processing, our architecture also enables streaming and interactive queries, while keeping MapReduce's scalability and fault tolerance. And whereas most deployed systems only support simple one-pass computations (e.g., SQL queries), ours also extends to the multi-pass algorithms required for complex analytics like machine learning. Finally, unlike the specialized systems proposed for some of these workloads, our architecture allows these computations to be combined, enabling rich new applications that intermix, for example, streaming and batch processing. We achieve these results through a simple extension to MapReduce that adds primitives for data sharing, called Resilient Distributed Datasets (RDDs). We show that this is enough to capture a wide range of workloads. We implement RDDs in the open source Spark system, which we evaluate using synthetic and real workloads. Spark matches or exceeds the performance of specialized systems in many domains, while offering stronger fault tolerance properties and allowing these workloads to be combined. Finally, we examine the generality of RDDs from both a theoretical modeling perspective and a systems perspective. This version of the dissertation makes corrections throughout the text and adds a new section on the evolution of Apache Spark in industry since 2014. In addition, editing, formatting, and links for the references have been added.




Data Science and Big Data Analytics in Smart Environments


Book Description

Most applications generate large datasets, like social networking and social influence programs, smart cities applications, smart house environments, Cloud applications, public web sites, scientific experiments and simulations, data warehouse, monitoring platforms, and e-government services. Data grows rapidly, since applications produce continuously increasing volumes of both unstructured and structured data. Large-scale interconnected systems aim to aggregate and efficiently exploit the power of widely distributed resources. In this context, major solutions for scalability, mobility, reliability, fault tolerance and security are required to achieve high performance and to create a smart environment. The impact on data processing, transfer and storage is the need to re-evaluate the approaches and solutions to better answer the user needs. A variety of solutions for specific applications and platforms exist so a thorough and systematic analysis of existing solutions for data science, data analytics, methods and algorithms used in Big Data processing and storage environments is significant in designing and implementing a smart environment. Fundamental issues pertaining to smart environments (smart cities, ambient assisted leaving, smart houses, green houses, cyber physical systems, etc.) are reviewed. Most of the current efforts still do not adequately address the heterogeneity of different distributed systems, the interoperability between them, and the systems resilience. This book will primarily encompass practical approaches that promote research in all aspects of data processing, data analytics, data processing in different type of systems: Cluster Computing, Grid Computing, Peer-to-Peer, Cloud/Edge/Fog Computing, all involving elements of heterogeneity, having a large variety of tools and software to manage them. The main role of resource management techniques in this domain is to create the suitable frameworks for development of applications and deployment in smart environments, with respect to high performance. The book focuses on topics covering algorithms, architectures, management models, high performance computing techniques and large-scale distributed systems.




Cloud Computing in Remote Sensing


Book Description

This book provides the users with quick and easy data acquisition, processing, storage and product generation services. It describes the entire life cycle of remote sensing data and builds an entire high performance remote sensing data processing system framework. It also develops a series of remote sensing data management and processing standards. Features: Covers remote sensing cloud computing Covers remote sensing data integration across distributed data centers Covers cloud storage based remote sensing data share service Covers high performance remote sensing data processing Covers distributed remote sensing products analysis




Transactions on Large-Scale Data- and Knowledge-Centered Systems XX


Book Description

The LNCS journal Transactions on Large-Scale Data- and Knowledge-Centered Systems focuses on data management, knowledge discovery, and knowledge processing, which are core and hot topics in computer science. Since the 1990s, the Internet has become the main driving force behind application development in all domains. An increase in the demand for resource sharing across different sites connected through networks has led to an evolution of data- and knowledge-management systems from centralized systems to decentralized systems enabling large-scale distributed applications providing high scalability. Current decentralized systems still focus on data and knowledge as their main resource. Feasibility of these systems relies basically on P2P (peer-to-peer) techniques and the support of agent systems with scaling and decentralized control. Synergy between grids, P2P systems, and agent technologies is the key to data- and knowledge-centered systems in large-scale environments. This, the 20th issue of Transactions on Large-Scale Data- and Knowledge-Centered Systems, presents a representative and useful selection of articles covering a wide range of important topics in the domain of advanced techniques for big data management. Big data has become a popular term, used to describe the exponential growth and availability of data. The recent radical expansion and integration of computation, networking, digital devices, and data storage has provided a robust platform for the explosion in big data, as well as being the means by which big data are generated, processed, shared, and analyzed. In general, data are only useful if meaning and value can be extracted from them. Big data discovery enables data scientists and other analysts to uncover patterns and correlations through analysis of large volumes of data of diverse types. Insights gleaned from big data discovery can provide businesses with significant competitive advantages, leading to more successful marketing campaigns, decreased customer churn, and reduced loss from fraud. In practice, the growing demand for large-scale data processing and data analysis applications has spurred the development of novel solutions from both industry and academia.




Data Processing and Modeling with Hadoop


Book Description

Understand data in a simple way using a data lake. KEY FEATURES ● In-depth practical demonstration of Hadoop/Yarn concepts with numerous examples. ● Includes graphical illustrations and visual explanations for Hadoop commands and parameters. ● Includes details of dimensional modeling and Data Vault modeling. ● Includes details of how to create and define a structure to a data lake. DESCRIPTION The book 'Data Processing and Modeling with Hadoop' explains how a distributed system works and its benefits in the big data era in a straightforward and clear manner. After reading the book, you will be able to plan and organize projects involving a massive amount of data. The book describes the standards and technologies that aid in data management and compares them to other technology business standards. The reader receives practical guidance on how to segregate and separate data into zones, as well as how to develop a model that can aid in data evolution. It discusses security and the measures that are utilized to reduce the impact of security. Self-service analytics, Data Lake, Data Vault 2.0, and Data Mesh are discussed in the book. After reading this book, the reader will have a thorough understanding of how to structure a data lake, as well as the ability to plan, organize, and carry out the implementation of a data-driven business with full governance and security. WHAT YOU WILL LEARN ● Learn the basics of components to the Hadoop Ecosystem. ● Understand the structure, files, and zones of a Data Lake. ● Learn to implement the security part of the Hadoop Ecosystem. ● Learn to work with the Data Vault 2.0 modeling. ● Learn to develop a strategy to define good governance. ● Learn new tools to work with Data and Big Data WHO THIS BOOK IS FOR This book caters to big data developers, technical specialists, consultants, and students who want to build good proficiency in big data. Knowing basic SQL concepts, modeling, and development would be good, although not mandatory. TABLE OF CONTENTS 1. Understanding the Current Moment 2. Defining the Zones 3. The Importance of Modeling 4. Massive Parallel Processing 5. Doing ETL/ELT 6. A Little Governance 7. Talking About Security 8. What Are the Next Steps?