Fault tolerance

Updated: 07/06/2021 by Computer Hope
Fault tolerance

Fault tolerance is a quality of a computer system that gracefully handles the failure of component hardware or software. A system can be described as fault tolerant if it continues to operate satisfactorily in the presence of one or more system failure conditions.

Fault tolerance can be achieved by anticipating failures and incorporating preventative measures in the system design. Below are examples of techniques to mitigate and tolerate failure in a computer system.

How to design for fault tolerance

  1. Power failure - Have the computer or network device running on a UPS (uninterruptible power supply). During a power outage, make sure the UPS can notify an administrator and properly turn off the computer after a few minutes if power is not restored.
  2. Power surge - If no UPS connects to the computer or the UPS does not provide surge protection, connected devices are not protected. We recommend a surge protector to help protect against a power surge.
  3. Data loss - Run backups daily or at least monthly on the computer if important information is stored on it. Create a mirror of the data on an alternate location.
  4. Device or computer failure - Have a second device, computer, or computer hardware components available in case a failure causes a long down time.
  5. Unauthorized access - If connected to a network, set up a firewall.
  6. Frequently check for updates - Make sure the operating system and any running programs have the latest updates.
  7. Lock device or password protect computer - When not in use lock the computer and store the computer or network device in a secure area.
  8. Overload - Setup an alternate computer or network device to use as an alternative access point or can share the load either through a load balancing or round robin setup.
  9. Virus - Make sure the computer has updated virus definitions.

Error, Failover, Fault, Fencing, Network terms, Overload, SPOF