Thursday, January 25, 2007

SAN problems.

It's has been a busy January for me. In December 2006, we had severe issues with our SAN (and just before the Christmas holidays too) where we suffered outages that spanned a week with the root cause being the Fibre Channel switches resetting themselves at the same time. This caused our Storage Management servers to attempt to take over from each other causing a never-ending loop.

Apparently there was a bug in the firmware of the switches which caused them to reset. An upgrade of the firmware for the switches were necessary and was planned for the weekend of Jan 20th, 2007.

The morale of the story is even though you are setting up high availability and redundancy, make sure not to forget to ensure that the redundant components are "timed" differently. So there is now an outstanding work request to have one of the Fibre Channel switch reset several months so that its "reset cycle" is different from the other one.

No comments: