Database Reliability Engineering


Book Description

The infrastructure-as-code revolution in IT is also affecting database administration. With this practical book, developers, system administrators, and junior to mid-level DBAs will learn how the modern practice of site reliability engineering applies to the craft of database architecture and operations. Authors Laine Campbell and Charity Majors provide a framework for professionals looking to join the ranks of today’s database reliability engineers (DBRE). You’ll begin by exploring core operational concepts that DBREs need to master. Then you’ll examine a wide range of database persistence options, including how to implement key technologies to provide resilient, scalable, and performant data storage and retrieval. With a firm foundation in database reliability engineering, you’ll be ready to dive into the architecture and operations of any modern database. This book covers: Service-level requirements and risk management Building and evolving an architecture for operational visibility Infrastructure engineering and infrastructure management How to facilitate the release management process Data storage, indexing, and replication Identifying datastore characteristics and best use cases Datastore architectural components and data-driven architectures







Improving Product Reliability and Software Quality


Book Description

The authoritative guide to the effective design and production of reliable technology products, revised and updated While most manufacturers have mastered the process of producing quality products, product reliability, software quality and software security has lagged behind. The revised second edition of Improving Product Reliability and Software Quality offers a comprehensive and detailed guide to implementing a hardware reliability and software quality process for technology products. The authors – noted experts in the field – provide useful tools, forms and spreadsheets for executing an effective product reliability and software quality development process and explore proven software quality and product reliability concepts. The authors discuss why so many companies fail after attempting to implement or improve their product reliability and software quality program. They outline the critical steps for implementing a successful program. Success hinges on establishing a reliability lab, hiring the right people and implementing a reliability and software quality process that does the right things well and works well together. Designed to be accessible, the book contains a decision matrix for small, medium and large companies. Throughout the book, the authors describe the hardware reliability and software quality process as well as the tools and techniques needed for putting it in place. The concepts, ideas and material presented are appropriate for any organization. This updated second edition: Contains new chapters on Software tools, Software quality process and software security. Expands the FMEA section to include software fault trees and software FMEAs. Includes two new reliability tools to accelerate design maturity and reduce the risk of premature wearout. Contains new material on preventative maintenance, predictive maintenance and Prognostics and Health Management (PHM) to better manage repair cost and unscheduled downtime. Presents updated information on reliability modeling and hiring reliability and software engineers. Includes a comprehensive review of the reliability process from a multi-disciplinary viewpoint including new material on uprating and counterfeit components. Discusses aspects of competition, key quality and reliability concepts and presents the tools for implementation. Written for engineers, managers and consultants lacking a background in product reliability and software quality theory and statistics, the updated second edition of Improving Product Reliability and Software Quality explores all phases of the product life cycle.




Applied Reliability Engineering


Book Description




Site Reliability Engineering


Book Description

The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization. This book is divided into four sections: Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE) Practices—Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systems Management—Explore Google's best practices for training, communication, and meetings that your organization can use




Reliability Centered Maintenance – Reengineered


Book Description

Reliability Centered Maintenance – Reengineered: Practical Optimization of the RCM Process with RCM-R® provides an optimized approach to a well-established and highly successful method used for determining failure management policies for physical assets. It makes the original method that was developed to enhance flight safety far more useful in a broad range of industries where asset criticality ranges from high to low. RCM-R® is focused on the science of failures and what must be done to enable long-term sustainably reliable operations. If used correctly, RCM-R® is the first step in delivering fewer breakdowns, more productive capacity, lower costs, safer operations and improved environmental performance. Maintenance has a huge impact on most businesses whether its presence is felt or not. RCM-R® ensures that the right work is done to guarantee there are as few nasty surprises as possible that can harm the business in any way. RCM-R® was developed to leverage on RCM’s original success at delivering that effectiveness while addressing the concerns of the industrial market. RCM-R® addresses the RCM method and shortfalls in its application -- It modifies the method to consider asset and even failure mode criticality so that rigor is applied only where it is truly needed. It removes (within reason) the sources of concern about RCM being overly rigorous and too labor intensive without compromising on its ability to deliver a tailored failure management program for physical assets sensitive to their operational context and application. RCM-R® also provides its practitioners with standard based guidance for determining meaningful failure modes and causes facilitating their analysis for optimum outcome. Includes extensive review of the well proven RCM method and what is needed to make it successful in the industrial environment Links important elements of the RCM method with relevant International Standards for risk management and failure management Enhances RCM with increased emphasis on statistical analysis, bringing it squarely into the realm of Evidence Based Asset Management Includes extensive, experience based advice on implementing and sustaining RCM based failure management programs




Launching Your Asset Reliability Transformation


Book Description

Every reliability improvement initiative that has failed or floundered has lacked sustained leadership from the senior executive. The programs were based on technical "common sense," not business value, and the lack of leadership meant the culture did not change. This book explains how to build a solid business case and win senior management support. It lays the foundation for a successful and sustained program: ensuring the needs and risks of the business are clearly understood, assessing the current state, identifying the gaps, establishing targets and priorities, jumpstarting with pilot projects, and building the economic justification.Appendices explain the economics of reliability (ROI, NPV, IRR, EVA, and more), the value of reliability (OEE, TEEP, safety, and more), Pareto analysis, asset criticality ranking, and selling to senior management.This book does not just tell you what you should do; it lays out a step-by-step guide for exactly how to do it successfully with eight core steps and 44 detailed recommended practices.If you want to launch a new program or revive an existing program, this is the place to start.




Distributed Tracing in Practice


Book Description

Since most applications today are distributed in some fashion, monitoring their health and performance requires a new approach. Enter distributed tracing, a method of profiling and monitoring distributed applications—particularly those that use microservice architectures. There’s just one problem: distributed tracing can be hard. But it doesn’t have to be. With this guide, you’ll learn what distributed tracing is and how to use it to understand the performance and operation of your software. Key players at LightStep and other organizations walk you through instrumenting your code for tracing, collecting the data that your instrumentation produces, and turning it into useful operational insights. If you want to implement distributed tracing, this book tells you what you need to know. You’ll learn: The pieces of a distributed tracing deployment: instrumentation, data collection, and analysis Best practices for instrumentation: methods for generating trace data from your services How to deal with (or avoid) overhead using sampling and other techniques How to use distributed tracing to improve baseline performance and to mitigate regressions quickly Where distributed tracing is headed in the future




Reliability Centered Maintenance (RCM)


Book Description

A properly implemented and managed RCM program can save millions in unscheduled maintenance and breakdowns. However, many have found the process daunting. Written by an expert with over 30 years of experience, this book introduces innovative approaches to simplify the RCM process such as: single vs. multiple failure analysis, hidden failures analysis, potentially critical components analysis, run-to-failure and the difference between redundant, standby, and backup functions. Included are real life examples of flawed preventive maintenance programs and how they led to disasters that could have easily been avoided. Also illustrated in detail, with real-life examples, is the step-by-step process for developing, implementing, and maintaining a premier classical RCM program. Senior management, middle management, supervisors, and craftsmen/technicians responsible for plant safety and reliability will find this book to be invaluable as a means for establishing a first class preventive maintenance program.