By Benjamin Hartley, December 19, 2008
It's 3 am on a weekend. One of your SharePoint servers just went down and you don't know why. Employees are on the phone, demanding the server be brought back up – now! Even though it's outside normal business hours, even though there's no one in the office, the server still needs to be up. Online collaboration can be a double-edged sword – while it means that users can do their work anytime and anywhere, it means that a server outage is always a major problem, even if it's the middle of night or a holiday on which people don't normally work. Outages must always be addressed immediately.
The best way to deal with outages is to prevent them. Prevention is often a thankless task – no one notices when the servers don't go down, after all. It's a very rewarding activity all the same; paid for in system reliability, long uptimes, and uninterrupted work. While infrastructure and hardware problems tend to get the most notice, software-related problems are far more likely to be the culprit in an outage. Most often, the basic problem is an unexpected software situation – in other words, the server crashed. Usually, server crashes can be avoided with appropriate testing and maintenance. A few minutes of work each day can prevent hours or even days of downtime later – or even prevent security breaches!
Modern computer software is terribly complicated. SharePoint, for example, requires that at minimum the server on which it is running already have installed Windows and IIS server. Each of these pieces of software requires other software be installed, which requires still other software… In short, any server will have several services running and dozens of programs installed. This software all interacts in extremely complex ways. Sometimes, software ends up acting in a fashion which was never anticipated. These are what the computer industry calls "bugs", and they're resolved by updating the system using a patch.
Every system administrator is familiar with "Black Tuesday" or "Patch Tuesday", the second Tuesday of the month and the day on which Microsoft releases regular patches. These patches are crucial to both security and availability. Without them, unexpected errors may crash your servers or hackers may be able to exploit recently discovered vulnerabilities. While downtime is generally to be avoided, it is of absolute importance to regularly apply patches. However, as with any new software installation, patches must be tested before being applied to the production platform.
An old adage in the computing field holds that "The biggest cause of unscheduled outages is scheduled outages". Often, a scheduled software update causes more problems than it fixes and the system cannot be easily restored. There's no good time to find this out but if it's going to happen, it's best to avoid making such discoveries on the production platform. While it may seem like an excessive expense, a development and test platform is key to assuring continued availability of a production platform. Would you buy a car without driving it? Rent an apartment without seeing it? Then why would you apply untried software to a business server without testing it? When looking to install new software, the only reasonable thing to do is to install it on a test platform first. Ideally, the test platform should be functionally identical to the production platform. This may not always be possible, as a complete replication of the production platform is exceptionally expensive. Even so, best effort is certainly called for here. It is not sufficient to merely install new software. The software must be tested. It must be used, and used under conditions which accurately reproduce the production environment. Even minor differences may conceal serious flaws, so it's important to make the test platform as realistic as possible.
Regularly updating your SharePoint servers does not guarantee they will not go down. There are always unexpected situations, even in software. But to ignore known issues in the hopes that they won't crop up would be very poor practice. Every system administrator needs to regularly check for updates and known problems with their systems, and address these problems as they are discovered. It's impossible to gauge how much downtime is prevented by such regular work, but there's no doubt that it's worth it. Regular updating and patching keeps servers up, even in the middle of the night. That means a company's road warriors and midnight owls can be working productively – while the system administrators get some well-earned sleep!