Integration of Tools for the Design and Assessment of High-Performance, Highly Reliable Computing Systems (Dahphrs), Phase 1


Book Description

Systems for Space Defense Initiative (SDI) space applications typically require both high performance and very high reliability. These requirements present the systems engineer evaluating such systems with the extremely difficult problem of conducting performance and reliability trade-offs over large design spaces. A controlled development process supported by appropriate automated tools must be used to assure that the system will meet design objectives. This report describes an investigation of methods, tools, and techniques necessary to support performance and reliability modeling for SDI systems development. Models of the JPL Hypercubes, the Encore Multimax, and the C.S. Draper Lab Fault-Tolerant Parallel Processor (FTPP) parallel-computing architectures using candidate SDI weapons-to-target assignment algorithms as workloads were built and analyzed as a means of identifying the necessary system models, how the models interact, and what experiments and analyses should be performed. As a result of this effort, weaknesses in the existing methods and tools were revealed and capabilities that will be required for both individual tools and an integrated toolset were identified. Scheper, C. and Baker, R. and Frank, G. and Yalamanchili, S. and Gray, G. Unspecified Center ARCHITECTURE (COMPUTERS); FAULT TOLERANCE; PARALLEL PROCESSING (COMPUTERS); RELIABILITY ANALYSIS; RELIABILITY ENGINEERING; TRADEOFFS; ALGORITHMS; ALLOCATIONS; SYSTEMS ENGINEERING; TECHNOLOGY UTILIZATION...
















Proceedings


Book Description







Performance, Reliability, and Availability Evaluation of Computational Systems, Volume 2


Book Description

This textbook intends to be a comprehensive and substantially self-contained two-volume book covering performance, reliability, and availability evaluation subjects. The volumes focus on computing systems, although the methods may also be applied to other systems. The first volume covers Chapter 1 to Chapter 14, whose subtitle is ``Performance Modeling and Background". The second volume encompasses Chapter 15 to Chapter 25 and has the subtitle ``Reliability and Availability Modeling, Measuring and Workload, and Lifetime Data Analysis". This text is helpful for computer performance professionals for supporting planning, design, configuring, and tuning the performance, reliability, and availability of computing systems. Such professionals may use these volumes to get acquainted with specific subjects by looking at the particular chapters. Many examples in the textbook on computing systems will help them understand the concepts covered in each chapter. The text may also be helpful for the instructor who teaches performance, reliability, and availability evaluation subjects. Many possible threads could be configured according to the interest of the audience and the duration of the course. Chapter 1 presents a good number of possible courses programs that could be organized using this text. Volume II is composed of the last two parts. Part III examines reliability and availability modeling by covering a set of fundamental notions, definitions, redundancy procedures, and modeling methods such as Reliability Block Diagrams (RBD) and Fault Trees (FT) with the respective evaluation methods, adopts Markov chains, Stochastic Petri nets and even hierarchical and heterogeneous modeling to represent more complex systems. Part IV discusses performance measurements and reliability data analysis. It first depicts some basic measuring mechanisms applied in computer systems, then discusses workload generation. After, we examine failure monitoring and fault injection, and finally, we discuss a set of techniques for reliability and maintainability data analysis.