Latency Tradeoffs in Distributed Storage Access


Book Description

The performance of storage systems is central to handling the huge amount of data being generated from a variety of sources including scientific experiments, social media, crowdsourcing, and from an increasing variety of cyber-physical systems. The emerging high-speed storage technologies enable the ingestion of and access to such large volumes of data efficiently. However, the combination of high data volume requirements of new applications that largely generate unstructured and semistructured streams of data combined with the emerging high-speed storage technologies pose a number of new challenges, including the low latency handling of such data and ensuring that the network providing access to the data does not become the bottleneck. The traditional relational model is not well suited for efficiently storing and retrieving unstructured and semi-structured data. An alternate mechanism, popularly known as Key-Value Store (KVS) has been investigated over the last decade to handle such data. A KVS store only needs a 'key' to uniquely identify the data record, which may be of variable length and may or may not have further structure in the form of predefined fields. Most of the KVS in existence have been designed for hard-disk based storage (before the SSDs gain popularity) where avoiding random accesses is crucial for good performance. Unfortunately, as the modern solid-state drives become the norm as the data center storage, the HDD-based KV structures result in high read, write, and space amplifications which becomes detrimental to both the SSD's performance and endurance. Also note that regardless of how the storage systems are deployed, access to large amounts of storage by many nodes must necessarily go over the network. At the same time, the emerging storage technologies such as Flash, 3D-crosspoint, phase change memory (PCM), etc. coupled with highly efficient access protocols such as NVMe are capable of ingesting and reading data at rates that challenge even the leading edge networking technologies such as 100Gb/sec Ethernet. At the same time, some of the higher-end storage technologies (e.g., Intel Optane storage based on 3-D crosspoint technology, PCM, etc.) coupled with lean protocols like NVMe are capable of providing storage access latencies in the 10-20$\mu s$ range, which means that the additional latency due to network congestion could become significant. The purpose of this thesis is to addresses some of the aforementioned issues. We propose a new hash-based and SSD-friendly key-value store (KVS) architecture called FlashKey which is especially designed for SSDs to provide low access latencies, low read and write amplification, and the ability to easily trade-off latencies for any sequential access, for example, range queries. Through detailed experimental evaluation of FlashKey against the two most popular KVs, namely, RocksDB and LevelDB, we demonstrate that even as an initial implementation we are able to achieve substantially better write amplification, average, and tail latency at a similar or better space amplification. Next, we try to deal with network congestion by dynamically replicating the data items that are heavily used. The tradeoff here is between the latency and the replication or migration overhead. It is important to reverse the replication or migration as the congestion fades away since our observation tells that placing data and applications (that access the data) together in a consolidated fashion would significantly reduce the propagation delay and increase the network energy-saving opportunities which is required as the data center network nowadays are equipped with high-speed and power-hungry network infrastructures. Finally, we designed a tradeoff between network consolidation and congestion. Here, we have traded off the latency to save power. During the quiet hours, we consolidate the traffic is fewer links and use different sleep modes for the unused links to save powers. However, as the traffic increases, we reactively start to spread out traffic to avoid congestion due to the upcoming traffic surge. There are numerous studies in the area of network energy management that uses similar approaches, however, most of them do energy management at a coarser time granularity (e.g. 24 hours or beyond). As opposed to that, our mechanism tries to steal all the small to medium time gaps in traffic and invoke network energy management without causing a significant increase in latency.




Robo-Line Storage


Book Description

Rapid advances in high performance computing are making possible more complete and accurate computer-based modeling of complex physical phenomena, such as weather front interactions, dynamics of chemical reactions, numerical aerodynamic analysis of airframes, and ocean-land-atmosphere interactions. Many of these 'grand challenge' applications are as demanding of the underlying storage system, in terms of their capacity and bandwidth requirements, as they are on the computational power of the processor. A global view of the Earth's ocean chlorophyll and land vegetation requires over 2 terabytes of raw satellite image data. In this paper, we describe our planned research program in high capacity, high bandwidth storage systems. The project has four overall goals. First, we will examine new methods for high capacity storage systems, made possible by low cost, small form factor magnetic and optical tape systems. Second, access to the storage system will be low latency and high bandwidth. To achieve this, we must interleave data transfer at all levels of the storage system, including devices, controllers, servers, and communications links. Latency will be reduced by extensive caching throughout the storage hierarchy. Third, we will provide effective management of a storage hierarchy, extending the techniques already developed for the Log Structured File System. Finally, we will construct a protototype high capacity file server, suitable for use on the National Research and Education Network (NREN). Such research must be a Cornerstone of any coherent program in high performance computing and communications. Katz, Randy H. and Anderson, Thomas E. and Ousterhout, John K. and Patterson, David A. Unspecified Center NAG2-591...







Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning


Book Description

Our society is experiencing a rapid growth of data amount because of the widely used mobile devices, sensors, and computers. Most recent estimations show that every day 2.5 exabytes data are generated worldwide. The analysis to this amount of data could enable more intelligent business decisions, faster scientific discoveries, and more accurate society services. Traditional data processing techniques in one single machine, such as relational database management systems, quickly showed their limitations when handling large amount of data. To satisfy the ever-growing demand for large scale data analysis, various public and commercial data analysis distributed systems are built up such as High Performance Computing and Cloud Computing systems. These data processing distributed systems, with their excellent concurrency, scalability, and fault tolerance, are gaining more attention nowadays in research institution and industry. People are already enjoying the benefits of collecting and analyzing large amount of data on some maturely deployed data processing distributed systems. Unfortunately data processing distributed systems have their own performance problems. More specifically, in device layer, the system is suering from long seeking latency problem in hard disks, which reduces I/O throughput when meeting random access I/O pattern. In framework layer, the system is experiencing straggler problem in parallel jobs, where the slowest task alone would prolong the job execution time even though all other tasks finished at an much earlier time. In algorithm layer, the system faces diculty to decide intermediate cache size, where the following phase's speed-up benefit is outweighed by the overhead incurred by writing and reading a large intermediate cache file. This thesis is to solve these problems, hence to improve distributed system performance, by exploiting data placement and partitioning. Specifically, we propose the following solutions to address the aforementioned three problems. Firstly, we propose to use a hybrid storage system with hard disk drives and solid state drives in HPC environment, where input data's layout is re-organized to hide the long seeking latency in hard disks. Secondly, we propose to use logical data partitioning strategies for input data, so that the distributed system could benefit from fine-grained task's ability of solving straggler problem without paying the prohibitive overhead. Lastly, when intermediate data can be saved to speed up the following job's execution, we propose an online analyzer to decide how much data to place into cache. We have designed and implemented prototypes for each work, and evaluated them with representative workloads and datasets on widely used distributed system platforms PVFS and Hadoop. Our evaluation results can achieve almost optimal results, which fit the theoretical performance improvement expectation. For device layer, we could achieve low latency storage device with aordable cost. In framework layer, we could achieve minimal phase execution time when meeting stragglers. In algorithm layer, we could achieve near optimal job execution time for MapReduce FIM algorithms. Furthermore, our prototypes have low system overhead, which is a necessity for wide application in practice.




Storage Network Performance Analysis


Book Description

Features vendor-neutral coverage applicable to any storage network Includes a special case-study section citing real-world applications and examples The first vendor-neutral volume to cover storage network performance tuning and optimization Exacting performance monitoring and analysis maximizes the efficiency and cost-effectiveness of existing storage networks Meets the needs of network administrators, storage engineers, and IT professionals faced with shrinking budgets and growing data storage demands




Algorithms and Architectures for Parallel Processing


Book Description

This three-volume set LNCS 12452, 12453, and 12454 constitutes the proceedings of the 20th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP 2020, in New York City, NY, USA, in October 2020. The total of 142 full papers and 5 short papers included in this proceedings volumes was carefully reviewed and selected from 495 submissions. ICA3PP is covering the many dimensions of parallel algorithms and architectures, encompassing fundamental theoretical approaches, practical experimental projects, and commercial components and systems. As applications of computing systems have permeated in every aspects of daily life, the power of computing system has become increasingly critical. This conference provides a forum for academics and practitioners from countries around the world to exchange ideas for improving the efficiency, performance, reliability, security and interoperability of computing systems and applications. ICA3PP 2020 focus on two broad areas of parallel and distributed computing, i.e. architectures, algorithms and networks, and systems and applications.




Storage Systems


Book Description

Storage Systems: Organization, Performance, Coding, Reliability and Their Data Processing was motivated by the 1988 Redundant Array of Inexpensive/Independent Disks proposal to replace large form factor mainframe disks with an array of commodity disks. Disk loads are balanced by striping data into strips—with one strip per disk— and storage reliability is enhanced via replication or erasure coding, which at best dedicates k strips per stripe to tolerate k disk failures. Flash memories have resulted in a paradigm shift with Solid State Drives (SSDs) replacing Hard Disk Drives (HDDs) for high performance applications. RAID and Flash have resulted in the emergence of new storage companies, namely EMC, NetApp, SanDisk, and Purestorage, and a multibillion-dollar storage market. Key new conferences and publications are reviewed in this book.The goal of the book is to expose students, researchers, and IT professionals to the more important developments in storage systems, while covering the evolution of storage technologies, traditional and novel databases, and novel sources of data. We describe several prototypes: FAWN at CMU, RAMCloud at Stanford, and Lightstore at MIT; Oracle's Exadata, AWS' Aurora, Alibaba's PolarDB, Fungible Data Center; and author's paper designs for cloud storage, namely heterogeneous disk arrays and hierarchical RAID. Surveys storage technologies and lists sources of data: measurements, text, audio, images, and video Familiarizes with paradigms to improve performance: caching, prefetching, log-structured file systems, and merge-trees (LSMs) Describes RAID organizations and analyzes their performance and reliability Conserves storage via data compression, deduplication, compaction, and secures data via encryption Specifies implications of storage technologies on performance and power consumption Exemplifies database parallelism for big data, analytics, deep learning via multicore CPUs, GPUs, FPGAs, and ASICs, e.g., Google's Tensor Processing Units




RobuSTore


Book Description

Emerging large-scale scientific applications involve massive, distributed, shared data collections (petabytes), and require robust, high performance for read-dominated workloads. Achieving robust performance (low variability) in storage systems is difficult. We propose RobuSTore, a novel storage technique, which combines erasure codes and speculative access to reduce performance variability and increase performance. RobuSTore uses erasure codes to add flexible redundancy then spreads the encoded data across a large number of disks. Speculative access to the redundant data from multiple disks enables application requests to be satisfied with only early-arriving blocks, reducing performance dependence on the behavior of individual disks. We present the design and an evaluation of RobuSTore which shows improved robustness, reducing the standard deviation of access latencies by as much as 5-fold vs. traditional RAID. In addition, RobuSTore improves access bandwidth by as much as 15-fold. A typical 1 GB read from 64 disks has average latency of 2 seconds with standard deviation of only 0.5 seconds or 25%. RobuSTore secures these benefits at the cost of a 2-3x storage capacity overhead and ~1.5x network and disk I/O overhead.




Parallel and Distributed Processing and Applications


Book Description

This book constitutes the refereed proceedings of the Third International Symposium on Parallel and Distributed Processing and Applications, ISPA 2005, held in Nanjing, China in November 2005. The 90 revised full papers and 19 revised short papers presented together with 3 keynote speeches and 2 tutorials were carefully reviewed and selected from 645 submissions. The papers are organized in topical sections on cluster systems and applications, performance evaluation and measurements, distributed algorithms and systems, fault tolerance and reliability, high-performance computing and architecture, parallel algorithms and systems, network routing and communication algorithms, security algorithms and systems, grid applications and systems, database applications and data mining, distributed processing and architecture, sensor networks and protocols, peer-to-peer algorithms and systems, internet computing and Web technologies, network protocols and switching, and ad hoc and wireless networks.