By Benjamin Hartley, January 30, 2009
Outages happen. Sometimes they are avoidable, but often not. Careful planning, good system administration, and a solid infrastructure can handle short-term power outages, software issues, and even security breaches. Unfortunately, there's no amount of planning which can effectively meet the challenges posed by hurricanes, earthquakes, building fires, or other large-scale catastrophes. Sometimes, the only option is to rebuild. No matter how important an IT infrastructure is, a company is built on people. The hardware can and should be replaced. After all, with a good online collaboration platform, a company can even function without offices!
A server doesn't replace itself, though. There needs to be a redundancy and replacement plan as part of any continuity of operations plan (COOP). There are three strategies for platform redundancy: hot, warm, and cold sites. Hot sites are kept fully functional at all times, ready to take over at a moment's notice. A cold site, by contrast, requires significant effort to bring online, but at suitably reduced expense. A warm site, of course, is somewhere between the two, taking less effort to bring online than a cold site, but costing less than a hot site.
A "hot" server setup is one where there is a server or server farm capable of seamlessly taking over if the primary servers fail. In an ideal setup, the hot site is located at a separate site from the primary server thus assuring that even with the most catastrophic failure at the main site the system will continue to function. Unfortunately, such a setup is extremely expensive. It requires complete duplication of the server hardware, an actual separate server site, and a secure WAN connection in order to replicate the data. Also, the network setup for a truly seamless failover is very difficult, allowing almost no latency and often requiring a great deal of bandwidth. The hardware costs alone are prohibitive. The maintenance costs and time spent by the system administrators make this even more draining. However, this is really the only way to achieve absolute reliability. This is the redundancy method used by telecommunications networks which seek the "nine 9s". However, unless this near-perfect reliability is necessary a hot site backup option is seldom used.
The technical issues of a hot site setup are often too great to easily resolve. In such cases, a "warm" site option may be preferred. A warm site setup is much like a hot site, but takes more work to render functional. Like a hot site, a warm site failover has identical dedicated hardware. Unlike a hot site, the backup servers are not constantly running. They are generally ready to run, but require some intervention and thus cannot accomplish a seamless failover. In some cases the warm site may need the most recent incremental backup to be applied. Once the servers are ready to take over, networking changes must be made to redirect requests to the backup servers. Warm sites are much easier to maintain than hot sites, as they do not require the constant effort to maintain replication nor the painstaking network configuration to assure an immediate transfer. However, a warm site may take several minutes to several hours before it can fully take over in an outage. In addition, a warm site setup still has all the hardware costs of a hot site although unlike with the hot backup more compromises can be made and less important servers might be omitted or older servers might be used. As an additional advantage, a warm site can often be a very useful platform for testing updates, patches, new software, and confirm that backups are working properly.
Due to the prohibitive expense of both hot and warm site setups, which call for dedicated backup servers, most organizations prefer a "cold" server plan. A cold site backup means that the server isn't configured to take over. Usually there are servers hosting less important applications which can be co-opted if a more important application fails. While cold site plans are definitely the least expensive option, the downtime while servers are being configured may be unacceptable in some cases. Also, since the backup servers are very likely not identical or even all that similar to the production servers, special care must be taken to assure that they can do the job. Even so, there will very likely be a considerable amount of work involved in bringing such a site online.
In some cases a company will not have any means to replace the servers at all. Nonetheless, it is important to maintain a clear (and offsite) record of what hardware is necessary. If this is the case, and funding exists, it is possible to purchase new hardware and bring it online in a surprisingly short period of time. The key, here, is in knowing exactly what is needed. When there has been a catastrophic loss of hardware there's no time to perform a detailed study, spec out the perfect system, get a list of potential vendors, compare quotes, gain funding approval, and only then order the new equipment. Instead, you need to be ready to order the new equipment as soon as possible. This calls for a great deal of preparation – the hardware must already be selected, the vendor should be known, and most importantly the management must be ready to act quickly. However, for organizations which don't have the money to keep unused equipment, or when a few days of downtime won't be a disaster, this is a great way to save money. Unfortunately, the sometimes excessive delay can more than counter any savings; an organization must carefully consider the potential costs before choosing to leave out redundancy altogether.
These options allow an organization to balance cost with effectiveness. In all cases, however, a significant level of advance planning and preparation are necessary. Regardless of which redundancy method is used, careful attention must be paid to the backup hardware, to make certain it can run the necessary software and has the storage necessary. Also, the redundant servers must be tested regularly to assure that they really will work, and that the failover process is successful. Since in an emergency these procedures might be enacted by people who are not familiar with the system, they must be extremely well-documented.
Redundant systems allow for a much higher overall availability than would normally be possible with off-the-shelf hardware and software. There should be a plan in place for at minimum the replacement of any system, with more important systems seeing greater investment. Most importantly, redundancy and failover should be planned well in advance. The day the building collapses is much too late to be asking such questions. Redundancy and failover are things which must be carefully planned well before their need becomes apparent. As always, it pays to plan ahead. The better the planning is, the sooner the IT infrastructure can rebuilt – and the sooner the workers can be online and productive again.