High Availability is not Redundancy
This is about the “A” in the CIA triad of security: Confidentiality, Integrity, Availability
Just recently I was a witness of an incident where the failure of a perceived redundant system caused an outage of more than 5 hours of the central IT services of a multinational/intercontinental enterprise. Vital services like VoIP calls and conference bridges (which were interrupted with high profile customers) , SAP, e-mail, central file services, CAD, order processing, printing of delivery notes and therefore loading of trucks, processing of EDIFACT-based orders and invoices, etc. were unavailable for most of the 20.000 employees and customers worldwide during this black-out.
What happened?
Some when in the morning we noticed a lot of commotion in the department (open plan office) and quite soon it was obvious that all network based services were out for a late breakfast: no DNS, no login services, no Active Directory, no IP telephony, no Wireless, no Internet, no file servers -whatever you could think of- was unavailable (I hope they had at least a good time with a cafe latte and a croissant). Unthinkable you might think. Wrong. It’s possible. Although the network was designed to provide all sorts of redundancy: back-up links, re-convergence functions, independent power supplies and lines, redundant network links, dual network rooms, backup data center, topological independence -whatever- the whole system came to a grinding stand-still.
The central core switch was down. One of the last log messages we could retrieve was on the network management system which declared the central layer2/3 switch down. It was a Cisco Catalyst 6500 VSS (Virtual Switch System), which logically combines two independent physical boxes into a virtual switch. A perfect solution you might think: two independent boxes, sitting in different rooms separated by hundreds of meters, each with back-up power systems, each with network links running over different cable ducts. But still there is one dependency which couples those two devices too close for comfort, the common management.
And this common management was the culprit: a network administrator was copy-pasting a simple command sequence into the wrong ssh-window: write erase <cr> reload (confirm with “y”). Which means: erase the complete configuration (about 160 kB of of ASCII-script) and reboot. You can imagine that this didn’t turn out well. A simple sequence of roughly some 20 Bytes brought down the IT-services of a large enterprise for half a day.
Restoring a broken Catalyst 6500 VSS can take up to 3 or 4 hours if you get it right the first time. If not, it’s “Back to Start, don’t collect 200”.
Conclusions
- Redundancy implies independence. None of the components in a redundant design may be dependent on or share a common feature or resource with the the other components. Evaluate your vital services for such dependencies.
- To be human means to err. A single action of a human actor may not influence all components in a redundant system simultaneously.
- Least privilege: Only permit the minimum privileges to achieve the task for routine management. Reserve special logins for severe actions like rebooting or manipulating the configuration etc.
- Review your emergency plans (Do you have one? Who you gonna call when the phone book of your Unified Communication is unavailable, the ghost busters?). Have hard copies of important contacts to be called in emergencies via cellular phone.
nice PebKaC tale 🙂 and a execellent example why practicing Disaster Recovery Procedures and having a functional backup pays of