By now, you’ve likely heard the term ‘fault tolerance’ at least once. For those of you unfamiliar with its meaning, it refers to a computer system’s capacity to continue functioning in the event of hardware or software failure.
In essence, the system ‘tolerates’ a fault that would drive other systems to their knees. Sounds pretty simple…right?
There’s actually a lot more to fault tolerance than meets the eye – and it’s considerably more important than one might expect at first glance. Let’s talk about that. We’re going to go over a few of the facts you need to understand about fault tolerance.
Fault Tolerance And High Availability Are Not The Same Thing
A fault tolerant system isn’t necessarily the same thing as a high availability system – though the two are definitely closely related. The process of making components of a system more resilient is certainly element of availability, but there’s more to it than that. Eliminating your points of failure is only the first step.
A highly-available network can’t just be fault tolerant – it also needs to be able to recover with exceptional speed from a failure. Patches, data migrations, and application updates need to be carried out in real-time and without significantly impacting performance. In short, it needs to be able to keep running no matter what sort of software you need to run.
Building A Truly Fault-Tolerant System Is Quite Costly
Adding additional components has an irksome tendency to drive up the price of a system – to say nothing of the costs associated with crafting a software solution designed to keep the oft-complicated working parts of today’s networks and platforms functional. Taken together, all this means one thing: a pretty hefty capital investment.
Thankfully, unless you’re building all the infrastructure yourself, your host usually handles the heavy lifting here.
It’s A Lot More Complex Than You’d Expect – Especially On The Software Side
These days, designing a system to be fully fault-tolerant isn’t simply a matter of making sure it’s redundant. As evidenced by Netflix Engineer Ben Christensen’s piece on the service’s official blog, the difficulty of keeping a system available and fault tolerant increases exponentially as that system becomes more complicated. As such, there’s a good chance that if you’re running a multi-platform service (or simply operating on a large network), ensuring everything is redundant and available could turn into an absolute nightmare.
Being Fault Tolerant Doesn’t Mean You Won’t Lose Performance
The misconception a lot of users have about a fault tolerant system is that they’ll be absolutely fine in the event of a failure – that everything will continue operating as normal. This simply isn’t true. When a component fails, a fault tolerant system will certainly be able to keep running, but there’s also a good chance that performance will start to degrade until the busted component is replaced.
Without Fault Tolerance, Everything Else Comes Crashing Down
Last but certainly not least…fault tolerant systems are essential. Don’t believe me? Just look at the outage experienced by Netflix a few years back – and how damaging it was to both the company and its customers.
A couple years back, Netflix suffered from one of the biggest outages it had ever experienced. The culprit was the Amazon Web Services Elastic Load Balancer – a maintenance process inadvertently deleted vital data, which eventually caused the entire platform to seize up; this in turn resulted in a massive loss of service for customers across the world. Although Netflix was designed to be fault tolerant, it wasn’t made to deal with such a catastrophic failure.
As a result, its subscribers were left without access to the service on Christmas Eve – typically one of the platform’s busiest times of year.
“The Netflix API interacts with dozens of systems in our service-oriented architecture, which makes it inherently more vulnerable to any system failure or latencies across the stack,” explains Christensen.“Intermittent failure is guaranteed with this many variables, even if every dependency itself has excellent availability and uptime.”
“Without taking steps to ensure fault tolerance, 30 dependencies each with 99.99% uptime would result in 2+ hours downtime/month,” he continues.
In short, if you’re running any sort of online platform or service, fault tolerance needs to be an element of your planning – otherwise, you’re bound to fail.
Image credit: The National Archives (UK)