Cover image for Fault-Tolerant Systems.
Fault-Tolerant Systems.
Title:
Fault-Tolerant Systems.
Author:
Koren, Israel.
ISBN:
9780080492681
Personal Author:
Physical Description:
1 online resource (399 pages)
Contents:
Foreword -- Preface -- Acknowledgements -- About the Authors -- 1 Preliminaries -- 1.1 Fault Classification -- 1.2 Types of Redundancy -- 1.3 Basic Measures of Fault Tolerance -- 1.3.1 Traditional Measures -- 1.3.2 Network Measures -- 1.4 Outline of This Book -- 1.5 Further Reading -- References -- 2 Hardware Fault Tolerance -- 2.1 The Rate of Hardware Failures -- 2.2 Failure Rate, Reliability, and Mean Time to Failure -- 2.3 Canonical and Resilient Structures -- 2.3.1 Series and Parallel Systems -- 2.3.2 Non-Series/Parallel Systems -- 2.3.3 M-of-N Systems -- 2.3.4 Voters -- 2.3.5 Variations on N-Modular Redundancy -- 2.3.6 Duplex Systems -- 2.4 Other Reliability Evaluation Techniques -- 2.4.1 Poisson Processes -- 2.4.2 Markov Models -- 2.5 Fault-Tolerance Processor-Level Techniques -- 2.5.1 Watchdog Processor -- 2.5.2 Simultaneous Multithreading for Fault Tolerance -- 2.6 Byzantine Failures -- 2.6.1 Byzantine Agreement with Message Authentication -- 2.7 Further Reading -- 2.8 Exercises -- References -- 3 Information Redundancy -- 3.1 Coding -- 3.1.1 Parity Codes -- 3.1.2 Checksum -- 3.1.3 M-of-N Codes -- 3.1.4 Berger Code -- 3.1.5 Cyclic Codes -- 3.1.6 Arithmetic Codes -- 3.2 Resilient Disk Systems -- 3.2.1 RAID Level 1 -- 3.2.2 RAID Level 2 -- 3.2.3 RAID Level 3 -- 3.2.4 RAID Level 4 -- 3.2.5 RAID Level 5 -- 3.2.6 Modeling Correlated Failures -- 3.3 Data Replication -- 3.3.1 Voting: Non-Hierarchical Organization -- 3.3.2 Voting: Hierarchical Organization -- 3.3.3 Primary-Backup Approach -- 3.4 Algorithm-Based Fault Tolerance -- 3.5 Further Reading -- 3.6 Exercises -- References -- 4 Fault-Tolerant Networks -- 4.1 Measures of Resilience -- 4.1.1 Graph-Theoretical Measures -- 4.1.2 Computer Networks Measures -- 4.2 Common Network Topologies and Their Resilience -- 4.2.1 Multistage and Extra-Stage Networks -- 4.2.2 Crossbar Networks.

4.2.3 Rectangular Mesh and Interstitial Mesh -- 4.2.4 Hypercube Network -- 4.2.5 Cube-Connected Cycles Networks -- 4.2.6 Loop Networks -- 4.2.7 Ad Hoc Point-to-Point Networks -- 4.3 Fault-Tolerant Routing -- 4.3.1 Hypercube Fault-Tolerant Routing -- 4.3.2 Origin-Based Routing in the Mesh -- 4.4 Further Reading -- 4.5 Exercises -- References -- 5 Software Fault Tolerance -- 5.1 Acceptance Tests -- 5.2 Single-Version Fault Tolerance -- 5.2.1 Wrappers -- 5.2.2 Software Rejuvenation -- 5.2.3 Data Diversity -- 5.2.4 Software Implemented Hardware Fault Tolerance (SIHFT) -- 5.3 N-Version Programming -- 5.3.1 Consistent Comparison Problem -- 5.3.2 Version Independence -- 5.4 Recovery Block Approach -- 5.4.1 Basic Principles -- 5.4.2 Success Probability Calculation -- 5.4.3 Distributed Recovery Blocks -- 5.5 Preconditions, Postconditions, and Assertions -- 5.6 Exception-Handling -- 5.6.1 Requirements from Exception-Handlers -- 5.6.2 Basics of Exceptions and Exception-Handling -- 5.6.3 Language Support -- 5.7 Software Reliability Models -- 5.7.1 Jelinski-Moranda Model -- 5.7.2 Littlewood-Verrall Model -- 5.7.3 Musa-Okumoto Model -- 5.7.4 Model Selection and Parameter Estimation -- 5.8 Fault-Tolerant Remote Procedure Calls -- 5.8.1 Primary-Backup Approach -- 5.8.2 The Circus Approach -- 5.9 Further Reading -- 5.10 Exercises -- References -- 6 Checkpointing -- 6.1 What Is Checkpointing? -- 6.1.1 Why Is Checkpointing Nontrivial? -- 6.2 Checkpoint Level -- 6.3 Optimal Checkpointing - An Analytical Model -- 6.3.1 Time Between Checkpoints - A First-Order Approximation -- 6.3.2 Optimal Checkpoint Placement -- 6.3.3 Time Between Checkpoints - A More Accurate Model -- 6.3.4 Reducing Overhead -- 6.3.5 Reducing Latency -- 6.4 Cache-Aided Rollback Error Recovery (CARER) -- 6.5 Checkpointing in Distributed Systems -- 6.5.1 The Domino Effect and Livelock.

6.5.2 A Coordinated Checkpointing Algorithm -- 6.5.3 Time-Based Synchronization -- 6.5.4 Diskless Checkpointing -- 6.5.5 Message Logging -- 6.6 Checkpointing in Shared-Memory Systems -- 6.6.1 Bus-Based Coherence Protocol -- 6.6.2 Directory-Based Protocol -- 6.7 Checkpointing in Real-Time Systems -- 6.8 Other Uses of Checkpointing -- 6.9 Further Reading -- 6.10 Exercises -- References -- 7 Case Studies -- 7.1 NonStop Systems -- 7.1.1 Architecture -- 7.1.2 Maintenance and Repair Aids -- 7.1.3 Software -- 7.1.4 Modifications to the NonStop Architecture -- 7.2 Stratus Systems -- 7.3 Cassini Command and Data Subsystem -- 7.4 IBM G5 -- 7.5 IBM Sysplex -- 7.6 Itanium -- 7.7 Further Reading -- References -- 8 Defect Tolerance in VLSI Circuits -- 8.1 Manufacturing Defects and Circuit Faults -- 8.2 Probability of Failure and Critical Area -- 8.3 Basic Yield Models -- 8.3.1 The Poisson and Compound Poisson Yield Models -- 8.3.2 Variations on the Simple Yield Models -- 8.4 Yield Enhancement Through Redundancy -- 8.4.1 Yield Projection for Chips with Redundancy -- 8.4.2 Memory Arrays with Redundancy -- 8.4.3 Logic Integrated Circuits with Redundancy -- 8.4.4 Modifying the Floorplan -- 8.5 Further Reading -- 8.6 Exercises -- References -- 9 Fault Detection in Cryptographic Systems -- 9.1 Overview of Ciphers -- 9.1.1 Symmetric Key Ciphers -- 9.1.2 Public Key Ciphers -- 9.2 Security Attacks Through Fault Injection -- 9.2.1 Fault Attacks on Symmetric Key Ciphers -- 9.2.2 Fault Attacks on Public (Asymmetric) Key Ciphers -- 9.3 Countermeasures -- 9.3.1 Spatial and Temporal Duplication -- 9.3.2 Error-Detecting Codes -- 9.3.3 Are These Countermeasures Sufficient? -- 9.3.4 Final Comment -- 9.4 Further Reading -- 9.5 Exercises -- References -- 10 Simulation Techniques -- 10.1 Writing a Simulation Program -- 10.2 Parameter Estimation.

10.2.1 Point Versus Interval Estimation -- 10.2.2 Method of Moments -- 10.2.3 Method of Maximum Likelihood -- 10.2.4 The Bayesian Approach to Parameter Estimation -- 10.2.5 Confidence Intervals -- 10.3 Variance Reduction Methods -- 10.3.1 Antithetic Variables -- 10.3.2 Using Control Variables -- 10.3.3 Stratified Sampling -- 10.3.4 Importance Sampling -- 10.4 Random Number Generation -- 10.4.1 Uniformly Distributed Random Number Generators -- 10.4.2 Testing Uniform Random Number Generators -- 10.4.3 Generating Other Distributions -- 10.5 Fault Injection -- 10.5.1 Types of Fault Injection Techniques -- 10.5.2 Fault Injection Application and Tools -- 10.6 Further Reading -- 10.7 Exercises -- References -- Index.
Abstract:
There are many applications in which the reliability of the overall system must be far higher than the reliability of its individual components. In such cases, designers devise mechanisms and architectures that allow the system to either completely mask the effects of a component failure or recover from it so quickly that the application is not seriously affected. This is the work of fault-tolerant designers and their work is increasingly important and complex not only because of the increasing number of "mission critical” applications, but also because the diminishing reliability of hardware means that even systems for non-critical applications will need to be designed with fault-tolerance in mind. Reflecting the real-world challenges faced by designers of these systems, this book addresses fault tolerance design with a systems approach to both hardware and software. No other text on the market takes this approach, nor offers the comprehensive and up-to-date treatment Koren and Krishna provide. Students, designers and architects of high performance processors will value this comprehensive overview of the field. * The first book on fault tolerance design with a systems approach * Comprehensive coverage of both hardware and software fault tolerance, as well as information and time redundancy * Incorporated case studies highlight six different computer systems with fault-tolerance techniques implemented in their design * Available to lecturers is a complete ancillary package including online solutions manual for instructors and PowerPoint slides.
Local Note:
Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2017. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries.
Added Author:
Electronic Access:
Click to View
Holds: Copies: