Sunday, July 09, 2006

The night the lights went out in the server room.

Well, it had to happen. We lost complete power to our server room Wednesday morning. Kaput! Wished I could say that we were prepared and the failover process kicked in and we were humming along just fine.

Nope. No such luck. We do have a newly installed centralized UPS which replaced our individual server UPSes. The problem was that the new UPS was not fully configured to gracefully do a shut down of our servers if power was not restored within an acceptable timeframe. Apparently power failure occurred around 5:45-ish am but because we are not a 24x7 operations, nobody knew until the first shift started at 6:30am even though alarms were beeping. The culprit was the circuit breaker for the power coming into the UPS which in turn powers all of our servers. The power in the rest of the building was just fine.

It took us up to 5:00 am the following morning to recover all servers and verified that all systems (with some exceptions) were a go and that interrupted processing were rolled back or at least recovered. That’s a total of almost 24 hours of down-time! Good thing we are a government organization otherwise we would have been out of business. The biggest impact was to our Distribution business areas where we had to send the warehouse folks home since they are unable to work without the systems but we were able to get the stores full operational by 1:00 pm (when power was restored). The stores were able to function prior to that with the exception of being able to take customer gift cards since the validation of gift cards are done by a third party but transactions are routed via Head Office which was down because the network hub and switches are also located in the server room.

The recovery process should not have taken so long but there were instances where nothing could be done other than to wait. For example, it took a while for the electrician to pinpoint the circuit breaker as the culprit and then he has to go offsite to get a replacement circuit breaker and it was 1pm when power was restored. Once that huddle was dealt with, it was time for the systems folks to get down to work. During the wait for replacement circuit breaker, we worked out the priorities and tasks needed for recovery identifying the order that servers/systems needed to be brought back online. We had also contacted our vendor support to give them a heads up and also to reduce the turnaround time if we do need their support and assistance. Unfortunately one of them did not come through as needed which further delayed our recovery process by at least two and a half hours.

To add to our woes, our SAN management servers crashed while we were recovering so that set us back another hour or so. The systems were functional by around 11pm or so and the verification began and didn’t finished until 4:45am! By the time I head home around 5:15am for a shower and change of clothing, we were almost fully functional except for two of our systems which are non-critical and we were back to “normal” by 10am. So the total down-time from powered down to “normal” was from 6:45 am to 10am the following day for a total of 27 hours and 15 minutes! Imagine that! There will be definite changes once we completed our post-mortem review but it was quite an