Database Reliability Engineering


Book Description

The infrastructure-as-code revolution in IT is also affecting database administration. With this practical book, developers, system administrators, and junior to mid-level DBAs will learn how the modern practice of site reliability engineering applies to the craft of database architecture and operations. Authors Laine Campbell and Charity Majors provide a framework for professionals looking to join the ranks of today’s database reliability engineers (DBRE). You’ll begin by exploring core operational concepts that DBREs need to master. Then you’ll examine a wide range of database persistence options, including how to implement key technologies to provide resilient, scalable, and performant data storage and retrieval. With a firm foundation in database reliability engineering, you’ll be ready to dive into the architecture and operations of any modern database. This book covers: Service-level requirements and risk management Building and evolving an architecture for operational visibility Infrastructure engineering and infrastructure management How to facilitate the release management process Data storage, indexing, and replication Identifying datastore characteristics and best use cases Datastore architectural components and data-driven architectures







Executing Design for Reliability Within the Product Life Cycle


Book Description

At an early stage of the development, the design teams should ask questions such as, "How reliable will my product be?" "How reliable should my product be?" And, "How frequently does the product need to be repaired / maintained?" To answer these questions, the design team needs to develop an understanding of how and why their products fails; then, make only those changes to improve reliability while remaining within cost budget. The body of available literature may be separated into three distinct categories: "theory" of reliability and its associated calculations; reliability analysis of test or field data – provided the data is well behaved; and, finally, establishing and managing organizational reliability activities. The problem remains that when design engineers face the question of design for reliability, they are often at a loss. What is missing in the reliability literature is a set of practical steps without the need to turn to heavy statistics. Executing Design for Reliability Within the Product Life Cycle provides a basic approach to conducting reliability-related streamlined engineering activities, balancing analysis with a high-level view of reliability within product design and development. This approach empowers design engineers with a practical understanding of reliability and its role in the design process, and helps design team members assigned to reliability roles and responsibilities to understand how to deploy and utilize reliability tools. The authors draw on their experience to show how these tools and processes are integrated within the design and development cycle to assure reliability, and also to verify and demonstrate this reliability to colleagues and customers.




Applied Reliability Engineering


Book Description




Improving Product Reliability and Software Quality


Book Description

The authoritative guide to the effective design and production of reliable technology products, revised and updated While most manufacturers have mastered the process of producing quality products, product reliability, software quality and software security has lagged behind. The revised second edition of Improving Product Reliability and Software Quality offers a comprehensive and detailed guide to implementing a hardware reliability and software quality process for technology products. The authors – noted experts in the field – provide useful tools, forms and spreadsheets for executing an effective product reliability and software quality development process and explore proven software quality and product reliability concepts. The authors discuss why so many companies fail after attempting to implement or improve their product reliability and software quality program. They outline the critical steps for implementing a successful program. Success hinges on establishing a reliability lab, hiring the right people and implementing a reliability and software quality process that does the right things well and works well together. Designed to be accessible, the book contains a decision matrix for small, medium and large companies. Throughout the book, the authors describe the hardware reliability and software quality process as well as the tools and techniques needed for putting it in place. The concepts, ideas and material presented are appropriate for any organization. This updated second edition: Contains new chapters on Software tools, Software quality process and software security. Expands the FMEA section to include software fault trees and software FMEAs. Includes two new reliability tools to accelerate design maturity and reduce the risk of premature wearout. Contains new material on preventative maintenance, predictive maintenance and Prognostics and Health Management (PHM) to better manage repair cost and unscheduled downtime. Presents updated information on reliability modeling and hiring reliability and software engineers. Includes a comprehensive review of the reliability process from a multi-disciplinary viewpoint including new material on uprating and counterfeit components. Discusses aspects of competition, key quality and reliability concepts and presents the tools for implementation. Written for engineers, managers and consultants lacking a background in product reliability and software quality theory and statistics, the updated second edition of Improving Product Reliability and Software Quality explores all phases of the product life cycle.







Reliability Engineering for Nuclear and Other High Technology Systems (1985)


Book Description

First Published in 2017. This book presents a much needed practical methodology for the establishment of cost-effective reliability programs in nuclear or other high technology industries. Thanks to the high competence and practical experience of the authors in the field of reliability, it vividly illustrates the applicability of proven, cost-effective reliability techniques applied in the American space and military programs as hybridized with the avant-garde approach used by nuclear authorities, utilities and researchers in the United Kingdom and France. This emerged method will support a diligent effort in the enhancement of nuclear safety and protection of the health of the general public. The methodology developed in this book exemplifies the total integrated reliability program approach in the design, procurement, manufacturing, test, installation and operational phases of an equipment life cycle. It is based on lessons learned in space and military programs with certain methodological modifications to enhance practicality. The techniques described here are applicable to college instruction, plant upper and middle management personnel, as well as to regulating agencies with equal benefits; it provides a very pragmatic and cost-efficient approach to the reliability engineering discipline




Reliability Centered Maintenance – Reengineered


Book Description

Reliability Centered Maintenance – Reengineered: Practical Optimization of the RCM Process with RCM-R® provides an optimized approach to a well-established and highly successful method used for determining failure management policies for physical assets. It makes the original method that was developed to enhance flight safety far more useful in a broad range of industries where asset criticality ranges from high to low. RCM-R® is focused on the science of failures and what must be done to enable long-term sustainably reliable operations. If used correctly, RCM-R® is the first step in delivering fewer breakdowns, more productive capacity, lower costs, safer operations and improved environmental performance. Maintenance has a huge impact on most businesses whether its presence is felt or not. RCM-R® ensures that the right work is done to guarantee there are as few nasty surprises as possible that can harm the business in any way. RCM-R® was developed to leverage on RCM’s original success at delivering that effectiveness while addressing the concerns of the industrial market. RCM-R® addresses the RCM method and shortfalls in its application -- It modifies the method to consider asset and even failure mode criticality so that rigor is applied only where it is truly needed. It removes (within reason) the sources of concern about RCM being overly rigorous and too labor intensive without compromising on its ability to deliver a tailored failure management program for physical assets sensitive to their operational context and application. RCM-R® also provides its practitioners with standard based guidance for determining meaningful failure modes and causes facilitating their analysis for optimum outcome. Includes extensive review of the well proven RCM method and what is needed to make it successful in the industrial environment Links important elements of the RCM method with relevant International Standards for risk management and failure management Enhances RCM with increased emphasis on statistical analysis, bringing it squarely into the realm of Evidence Based Asset Management Includes extensive, experience based advice on implementing and sustaining RCM based failure management programs




Site Reliability Engineering


Book Description

The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization. This book is divided into four sections: Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE) Practices—Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systems Management—Explore Google's best practices for training, communication, and meetings that your organization can use