Our cloud platform supports business-critical operations for a huge number of organizations. Those organizations can’t risk having their services offline for even a second, but no server, hard drive, or network connection is 100% reliable — components fail and they fail unpredictably.
I’d like to take a look at some of the work we’ve done to make sure the inevitable failure of a component of our cloud platform has no effect on the performance and uptime of the sites and services it hosts.
A cloud platform is built on a complex foundation of hardware and software. Because all of these parts are interdependent, a failure of any component could mean the failure the system as a whole. A single point of failure puts the whole system at risk. But a system designed for high availability uses redundancy to ensure that there are no single points of failure.
Our goal is to make infrastructure deployment and management as easy and reliable as possible. Clients can deploy servers onto our cloud platform with a click of a button or an API request. But beneath the interface that clients see is a lot of complex engineering that includes physical servers, storage arrays, network hardware, external network connections, load balancers, a virtualization layer, and an extensive software stack.
These components depend on each other, and, without redundancy, a failure in any one could mean the failure the platform as a whole. If, for example, a fault develops in the storage that holds an important client database, the effectiveness of the entire cloud platform could be compromised for that client. The same is true of network connections: if a network connection fails, any sites and services running in the cloud could be cut off from the outside world.
Any component essential to the health of the entire system is a single point of failure. Traditional hosting environments are riddled with single points of failure. Consider the shared hosting environment used by many low-traffic websites: typically, each server will be crammed with as many sites as possible. If a fault develops with any part of the server, all of those sites will be offline.
Clearly that’s not acceptable for business critical services running in the cloud, which is why our cloud platform is designed to offer high availability.
A system designed for high availability uses redundancy to ensure that there are no single points of failure. It’s impossible to guarantee that any single component is reliable over the long-term. In fact, you can guarantee that a part of the system will fail at some point. High availability systems don’t put their trust in the reliability of any one component. Rather the system as a whole is designed to be reliable.
Every part of our high availability cloud has redundant backup systems and a failover mechanism. If one of our servers develops a fault, our system will detect the failure, remove the server from the pool, and all operations will be transferred to a redundant server. The same is true of network connections, storage, and various other parts of our cloud platform. We have engineered the platform so that the failure of one or several component doesn’t reduce the reliability of the system as a whole.