User-Level I/O Accelerations for High-Performance Deep Learning Applications


Book Description

With the popularity of microprocessors and scale-out system architectures, many large-scale high-performance computing (HPC) systems are built from a collection of compute servers, with an identical set of resources such as CPU, memory, and storage. A variety of applications have been leveraging the tremendous computation capacity on these large-scale HPC systems. Scientific applications and deep learning (DL) training are two of the popular workloads on HPC systems. However, with the rapid growth of the computation power, it has also become increasingly important to fill in the computation and I/O performance gap for these applications and workloads on HPC systems. In recent years, many research efforts have been made to explore user-level file systems on HPC systems for various workloads due to the flexibility of implementation and maintenance in user space. In particular, scientific applications which have two typical I/O patterns (checkpoint/restart and multi-dimensional I/O) have been able to utilize different specialized user-level file systems in a single job. However, non-trivial overheads can be introduced in such a method. We need to carefully review the overheads in order to mit- igate the performance degradation. In addition, the existing methods of using user-level file systems are not sufficient to meet the fundamental I/O needs of large-scale DL training on HPC systems. Firstly, in DL training, random samples are organized into batches to update model parameters in iterations. This is to avoid the model being biased by the input sequences' noise, which allows faster convergence speed and reduces memory consumption during the training computation. This results in massive random reads for data shuffling across the entire datasets on storage systems. Such a random read I/O pattern is significantly different from the traditional scientific workloads. Moreover, leadership HPC systems are often equipped with a large pool of burst buffers in the form of flash or non-volatile memory (NVM) devices. DL applica- tions running on these systems face the resource underutilization problem. This is because NVM devices' performance with respect to low latency and high bandwidth can be severely underutilized under heavy CPU and memory workloads. In this environment, the flash or NVMe storage devices are capable of low-latency and high-bandwidth I/O services, but the complex software stack significantly hampers such capabilities for I/O processing in the kernel. Also, due to DL training accuracy and performance concerns, the storage capacity and the performance of on-node storage devices on the nodes allocated to the training job are not sufficient to store an entire dataset and match the training speed, respectively.This dissertation focus on applying user-level file systems on HPC systems. Our overarching goal is to accelerate the I/O supports on HPC systems through specialized user-level file systems for popular workloads. In specific, we want to bring lightweight user-level file systems as efficient intermediates to reduce the performance overhead and ease the use of multiple FUSE file systems in a single job, orchestrate the data movement between storage tiers and DL applications, and improve the storage resource utilization for a pool of NVMe SSDs in DL training. Based on these design goals, we investigate the issues and challenges when applying existing user-level file systems to the popular workloads, then propose three strategies to meet our goals. Firstly, we have studied the problem of excessive cost in crossing the user-kernel boundary when using multiple traditional user-level file systems, and we design Direct-FUSE to support multiple FUSE file sys- tems as well as other, custom user-level file systems in user space without the need to cross the user/kernel boundary into the FUSE kernel module. All layers of Direct-FUSE are in user space, and applications can directly use pre-defined unified file system calls to interact with different user-defined file systems. Our performance results show that Direct-FUSE can outperform some native FUSE file systems and does not add significant overhead over backend file systems. Secondly, we examine the I/O patterns of deep neural networks and study the performance overheads when loading samples from some popular DL applications. Then, we introduce an entropy-aware I/O framework called DeepIO for large-scale deep learning on HPC systems. It coordinates the use of memory, communication, and I/O resources for efficient training of datasets. DeepIO features an I/O pipeline that utilizes several novel optimizations: RDMA (Remote Direct Memory Access)-assisted in-situ shuffling, input pipelining, and entropy-aware opportunistic ordering. It outperforms the state-of-the-art persistent memory based distributed file systems for efficient sample load- ing during DL training. Thirdly, besides examining the I/O patterns of deep neural networks, we also reveal a critical need of loading many small samples randomly and the issues of storage resources underutilization for successful training. Based on these understandings, we design a specialized Deep Learning File System (DLFS) with an in-memory tree-based sample directory for metadata management and user-level storage disaggregation through the SPDK protocol. Our experimental results show that DLFS can dramatically improve the throughput of training for deep neural networks when compared with the kernel-based local Ext4 file system. Furthermore, DLFS demonstrates its capability of achieving efficient user-level storage disaggregation with very little CPU utilization. In conclusion, the first branch concentrates on enriching the functionality and enhancing the performance of the Direct-FUSE framework; the second and third branches focus on wisely storing and prefetching datasets with the coordination of hierarchical storage tiers and fast interconnect, respectively. By exploring these three branches, we can further accelerate the specialized user-level file systems for popular workloads on HPC systems.




High-Performance Big Data Computing


Book Description

An in-depth overview of an emerging field that brings together high-performance computing, big data processing, and deep lLearning. Over the last decade, the exponential explosion of data known as big data has changed the way we understand and harness the power of data. The emerging field of high-performance big data computing, which brings together high-performance computing (HPC), big data processing, and deep learning, aims to meet the challenges posed by large-scale data processing. This book offers an in-depth overview of high-performance big data computing and the associated technical issues, approaches, and solutions. The book covers basic concepts and necessary background knowledge, including data processing frameworks, storage systems, and hardware capabilities; offers a detailed discussion of technical issues in accelerating big data computing in terms of computation, communication, memory and storage, codesign, workload characterization and benchmarking, and system deployment and management; and surveys benchmarks and workloads for evaluating big data middleware systems. It presents a detailed discussion of big data computing systems and applications with high-performance networking, computing, and storage technologies, including state-of-the-art designs for data processing and storage systems. Finally, the book considers some advanced research topics in high-performance big data computing, including designing high-performance deep learning over big data (DLoBD) stacks and HPC cloud technologies.




High Performance Computing


Book Description

This volume constitutes the papers of several workshops which were held in conjunction with the 38th International Conference on High Performance Computing, ISC High Performance 2023, held in Hamburg, Germany, during May 21–25, 2023. The 49 revised full papers presented in this book were carefully reviewed and selected from 70 submissions. ISC High Performance 2023 presents the following workshops: ​2nd International Workshop on Malleability Techniques Applications in High-Performance Computing (HPCMALL) 18th Workshop on Virtualization in High-Performance Cloud Computing (VHPC 23) HPC I/O in the Data Center (HPC IODC) Workshop on Converged Computing of Cloud, HPC, and Edge (WOCC’23) 7th International Workshop on In Situ Visualization (WOIV’23) Workshop on Monitoring and Operational Data Analytics (MODA23) 2nd Workshop on Communication, I/O, and Storage at Scale on Next-Generation Platforms: Scalable Infrastructures First International Workshop on RISC-V for HPC Second Combined Workshop on Interactive and Urgent Supercomputing (CWIUS) HPC on Heterogeneous Hardware (H3)




Applied Machine Learning and High-Performance Computing on AWS


Book Description

Build, train, and deploy large machine learning models at scale in various domains such as computational fluid dynamics, genomics, autonomous vehicles, and numerical optimization using Amazon SageMaker Key FeaturesUnderstand the need for high-performance computing (HPC)Build, train, and deploy large ML models with billions of parameters using Amazon SageMakerLearn best practices and architectures for implementing ML at scale using HPCBook Description Machine learning (ML) and high-performance computing (HPC) on AWS run compute-intensive workloads across industries and emerging applications. Its use cases can be linked to various verticals, such as computational fluid dynamics (CFD), genomics, and autonomous vehicles. This book provides end-to-end guidance, starting with HPC concepts for storage and networking. It then progresses to working examples on how to process large datasets using SageMaker Studio and EMR. Next, you'll learn how to build, train, and deploy large models using distributed training. Later chapters also guide you through deploying models to edge devices using SageMaker and IoT Greengrass, and performance optimization of ML models, for low latency use cases. By the end of this book, you'll be able to build, train, and deploy your own large-scale ML application, using HPC on AWS, following industry best practices and addressing the key pain points encountered in the application life cycle. What you will learnExplore data management, storage, and fast networking for HPC applicationsFocus on the analysis and visualization of a large volume of data using SparkTrain visual transformer models using SageMaker distributed trainingDeploy and manage ML models at scale on the cloud and at the edgeGet to grips with performance optimization of ML models for low latency workloadsApply HPC to industry domains such as CFD, genomics, AV, and optimizationWho this book is for The book begins with HPC concepts, however, it expects you to have prior machine learning knowledge. This book is for ML engineers and data scientists interested in learning advanced topics on using large datasets for training large models using distributed training concepts on AWS, deploying models at scale, and performance optimization for low latency use cases. Practitioners in fields such as numerical optimization, computation fluid dynamics, autonomous vehicles, and genomics, who require HPC for applying ML models to applications at scale will also find the book useful.




Learning Deep Learning


Book Description

NVIDIA's Full-Color Guide to Deep Learning: All You Need to Get Started and Get Results "To enable everyone to be part of this historic revolution requires the democratization of AI knowledge and resources. This book is timely and relevant towards accomplishing these lofty goals." -- From the foreword by Dr. Anima Anandkumar, Bren Professor, Caltech, and Director of ML Research, NVIDIA "Ekman uses a learning technique that in our experience has proven pivotal to success—asking the reader to think about using DL techniques in practice. His straightforward approach is refreshing, and he permits the reader to dream, just a bit, about where DL may yet take us." -- From the foreword by Dr. Craig Clawson, Director, NVIDIA Deep Learning Institute Deep learning (DL) is a key component of today's exciting advances in machine learning and artificial intelligence. Learning Deep Learning is a complete guide to DL. Illuminating both the core concepts and the hands-on programming techniques needed to succeed, this book is ideal for developers, data scientists, analysts, and others--including those with no prior machine learning or statistics experience. After introducing the essential building blocks of deep neural networks, such as artificial neurons and fully connected, convolutional, and recurrent layers, Magnus Ekman shows how to use them to build advanced architectures, including the Transformer. He describes how these concepts are used to build modern networks for computer vision and natural language processing (NLP), including Mask R-CNN, GPT, and BERT. And he explains how a natural language translator and a system generating natural language descriptions of images. Throughout, Ekman provides concise, well-annotated code examples using TensorFlow with Keras. Corresponding PyTorch examples are provided online, and the book thereby covers the two dominating Python libraries for DL used in industry and academia. He concludes with an introduction to neural architecture search (NAS), exploring important ethical issues and providing resources for further learning. Explore and master core concepts: perceptrons, gradient-based learning, sigmoid neurons, and back propagation See how DL frameworks make it easier to develop more complicated and useful neural networks Discover how convolutional neural networks (CNNs) revolutionize image classification and analysis Apply recurrent neural networks (RNNs) and long short-term memory (LSTM) to text and other variable-length sequences Master NLP with sequence-to-sequence networks and the Transformer architecture Build applications for natural language translation and image captioning NVIDIA's invention of the GPU sparked the PC gaming market. The company's pioneering work in accelerated computing--a supercharged form of computing at the intersection of computer graphics, high-performance computing, and AI--is reshaping trillion-dollar industries, such as transportation, healthcare, and manufacturing, and fueling the growth of many others. Register your book for convenient access to downloads, updates, and/or corrections as they become available. See inside book for details.




IBM Power System S822LC for High Performance Computing Introduction and Technical Overview


Book Description

This IBM® RedpaperTM publication is a comprehensive guide that covers the IBM Power SystemTM S822LC for High Performance Computing (HPC) server (8335-GTB model). The S822LC for HPC server is designed for high-performance computing applications that support the Linux operating system and high-performance data analytics, the enterprise data center, and accelerated cloud deployments. This paper introduces the major innovative S822LC for HPC server features and their relevant functions: Powerful IBM POWER8® processors that offer 16 cores at 3.259 GHz with 3.857 GHz turbo performance or 20 cores at 2.860 GHz with 3.492 GHz turbo A 19-inch rack-mount 2U configuration NVIDIA NVLink technology for exceptional processor-to-accelerator intercommunication Four dedicated connectors for the NVIDIA Tesla P100 GPU This publication is for professionals who want to acquire a better understanding of IBM Power Systems products and is intended for the following audience: Clients Sales and marketing professionals Technical support professionals IBM Business Partners Independent software vendors This paper expands the set of IBM Power Systems documentation by providing a desktop reference that offers a detailed technical description of the S822LC for HPC server. This paper does not replace the latest marketing materials and configuration tools. It is intended as an additional source of information that, together with existing sources, can be used to enhance your knowledge of IBM server solutions.




IBM Reference Architecture for High Performance Data and AI in Healthcare and Life Sciences


Book Description

This IBM® Redpaper publication provides an update to the original description of IBM Reference Architecture for Genomics. This paper expands the reference architecture to cover all of the major vertical areas of healthcare and life sciences industries, such as genomics, imaging, and clinical and translational research. The architecture was renamed IBM Reference Architecture for High Performance Data and AI in Healthcare and Life Sciences to reflect the fact that it incorporates key building blocks for high-performance computing (HPC) and software-defined storage, and that it supports an expanding infrastructure of leading industry partners, platforms, and frameworks. The reference architecture defines a highly flexible, scalable, and cost-effective platform for accessing, managing, storing, sharing, integrating, and analyzing big data, which can be deployed on-premises, in the cloud, or as a hybrid of the two. IT organizations can use the reference architecture as a high-level guide for overcoming data management challenges and processing bottlenecks that are frequently encountered in personalized healthcare initiatives, and in compute-intensive and data-intensive biomedical workloads. This reference architecture also provides a framework and context for modern healthcare and life sciences institutions to adopt cutting-edge technologies, such as cognitive life sciences solutions, machine learning and deep learning, Spark for analytics, and cloud computing. To illustrate these points, this paper includes case studies describing how clients and IBM Business Partners alike used the reference architecture in the deployments of demanding infrastructures for precision medicine. This publication targets technical professionals (consultants, technical support staff, IT Architects, and IT Specialists) who are responsible for providing life sciences solutions and support.







IBM Power Systems L and LC Server Positioning Guide


Book Description

This IBM® RedpaperTM publication is written to assist you in locating the optimal server/workload fit within the IBM Power SystemsTM L and IBM OpenPOWER LC product lines. IBM has announced several scale-out servers, and as a partner in the OpenPOWER organization, unique design characteristics that are engineered into the LC line have broadened the suite of available workloads beyond typical client OS hosting. This paper looks at the benefits of the Power Systems L servers and OpenPOWER LC servers, and how they are different, providing unique benefits for Enterprise workloads and use cases.




Intelligent Computing Theories and Application


Book Description

This two-volume set of LNCS 12836 and LNCS 12837 constitutes - in conjunction with the volume LNAI 12838 - the refereed proceedings of the 17th International Conference on Intelligent Computing, ICIC 2021, held in Shenzhen, China in August 2021. The 192 full papers of the three proceedings volumes were carefully reviewed and selected from 458 submissions. The ICIC theme unifies the picture of contemporary intelligent computing techniques as an integral concept that highlights the trends in advanced computational intelligence and bridges theoretical research with applications. The theme for this conference is “Advanced Intelligent Computing Methodologies and Applications.” The papers are organized in the following subsections: Intelligent Computing in Computer Vision, Intelligent Control and Automation, Intelligent Modeling Technologies for Smart Cities, Machine Learning, and Theoretical Computational Intelligence and Applications.