The Basics Of Fault Tolerance

We’ll start with the bare minimum that you should be taking care of, courtesy of an article on Microsoft Technet.

First thing’s first, I’ve got one word for you: redundancy.

“No single dependency should take down the entire app,” explained Ben Christensen in a 2012 presentation on Netflix’s API. Although he’s talking about app design, his statement most definitely applies to server and hosting infrastructure as well. See, Fault tolerance is ultimately all about minimizing your points of failure.

What that means is that, assuming you’re handling server setup and maintenance on your own, you need to analyze every single application, piece of hardware, and network node whose failure could potentially bring your system offline – and you need to make sure that won’t happen.

To that end, you need to ensure that you have decent power infrastructure in place. I’m talking enterprise-grade uninterruptible power supplies, regularly-tested backup generators; the works. Although power outages may only happen infrequently, you don’t want your servers to be brought down when they do. There’s a lot to consider in terms of power, by the way.  A well-designed power system accounts for local power supply failures, voltage variations, and both short- and long-term outages.

Hardware, too, is extremely important. It’s imperative that both servers and networking hardware are constructed with redundancy in mind. Now, it’s worth mentioning here that your host will likely take care of all this stuff for you, assuming you’ve chosen the right one (more on that in a moment).

It also goes without saying that you need to keep your software up to date – especially the stuff related to security. Plenty of outages are caused by an application glitch, after all. Netflix’s Christmas 2012 outage was the result of a failure in its Elastic Load Balancing software. Although you aren’t always going to be able to prevent software failure, staying on top of software maintenance will help.

Last, but certainly not least, there’s monitoring. Keep a close watch on your servers in terms of both hardware and software, and make sure there’s a system in place to alert your administrators in the event of failure. Again, depending on your host, this might be taken care of – it’s important to educate yourself in that regard.

Anyway, that’s it for the basics – now to delve into some stuff that’s a little more advanced.

Employee Training

I have a fact for you: the majority of data center outages are caused not by power failure, or an application glitch, or a system flaw. They’re caused by human error. By someone doing something they aren’t supposed to.

Now, even though you probably aren’t running your own data center, this fact is definitely applicable to you. You’d be surprised what a simple mistake on the client’s end can do to their server. What that means is that you need to do what you can to make sure everyone working directly with your server is properly trained to do so.

Otherwise, you might find yourself dealing with some unscheduled – and unwelcome – downtime.

Above All, Choose The Right Host

As a lot of you have probably noticed, most of the advice on this page is stuff that’s generally handled by your host. That’s why, above all else, the best thing you can do to make sure your servers are fault tolerant is to choose a host who makes that sort of thing a priority. That way, you can be absolutely certain your stuff will be available to you whenever you need it – and that you won’t have to deal with losing access.