Last weekend saw one of the most catastrophic IT failures to directly affect the public, ever.
Just in case you have been off planet, starting early in the morning of Saturday May 27th, news began to emerge of a total outage of the booking and associated flight management systems of British Airways. The issue escalated, escalated and, you guessed it, escalated, so much that all flights were cancelled (for 3 hours, then 6 hours, then a whole day, then a couple of days, then the whole of the bank holiday weekend).
That is right, no flights out of Heathrow or Gatwick on the Bank Holiday at the start of the half-term break.
The reputational damage to BA is obviously huge. They would be better off dragging a superfluity (Google it) of Nuns by their crucifixes from a plane bound for Lourdes whilst filming an episode of Airport.
As the scale of the total ‘Foxtrot Uniform Charlie Kilo Uniform Papa’ unfolded, we began to hear messages that the issue was to do with a ‘power surge’.
Upon hearing this, the collective IT community stroked their chin, tutted and to a person muttered ‘not bleeding likely’. Friends and family of this blog may or may not have worked on the DataCentres concerned in this very incident and may or may not be fairly certain that all the power shizzle one might expect (surge protection/UPS/twin feeds etc.) is in place in the aforementioned DCs’
A statement from the company goes like this:
Bill Francis, Head of Group IT at BA’s owner International Airlines Group (IAG), has sent an email to staff saying an investigation so far had found that an Uninterruptible Power Supply to a core data centre at Heathrow was over-ridden on Saturday morning.
He said: “This resulted in the total immediate loss of power to the facility, bypassing the backup generators and batteries. This in turn meant that the controlled contingency migration to other facilities could not be applied.
“After a few minutes of this shutdown of power, it was turned back on in an unplanned and uncontrolled fashion, which created physical damage to the system, and significantly exacerbated the problem.
“This was entirely a problem relating to the power supply. It was not an IT failure, and there were no software issues.
We don’t know what over-riding a UPS is, however… we have experienced significant issues at a DataCentre we used to frequent (Tango Tango) when they were testing the left/right power supplies and hadn’t done their sums correctly. They turned off the left side, the right side went up in flames and caused a power spike blowing up a load of kit. This was not good and is about the only plausible power story for the loss of a single BA DC. However, you can’t tell us that BA has just the one DC.
So, if BA lost one side of their power supply and had not allowed for the load on the other side this may have caused some problems, perhaps exacerbated by the restoration of power on the other feed. You can imagine servers, storage and networking kit having a proper meltdown.
That being said, it is impossible to think that an organisation, which relies so heavily on technology such as British Airways, does not have a tried and tested Business Continuity Plan (BCP), including regularly verifying the backups and associated restoration procedures.
The IT community is muttering about this issue having nothing to do with power and something to do with a change gone wrong. Could an organisation the size and importance of BA execute a change that could not be reversed early in the morning of a Bank Holiday Saturday? That would be very silly, wouldn’t it?
And here, we think, comes the tragic bit. It doesn’t really matter if this outage was caused by a power incident in a primary DataCentre, or by a failed change that couldn’t be backed out. The long and short of it is that the business continuity process, systems and the management thereof were left helpless by a predictable event, which must have been planned for.
It is well reported that the new (ish; April 2016) head honcho at BA, Alex Cruz has a ‘cost sensitive’ style and that somewhere between 600 and 900 IT jobs have been outsourced. Imagine the abject panic of the (outsourced) operations management when they realised that the backups were either not working, not tested, or nowhere to be seen. A truly buttock clenching moment, only slightly mitigated by the fact that demand for people in the offshore outsourcing world is so great that no matter one’s previous, a job at the outsourcing company in the next building would be assured.
It is really something that in a world of Google, Facebook, Uber etc. we have to listen to the Bravo Sierra of board members that do not have a clue about the systems which power their businesses. There is little technical or security representation on the boards of most large traditional Enterprises. IT is seen as a cost, which can be reduced, and reduced until everything goes ‘Tango India Tango Sierra Uniform Papa’.
The operation and execution of information systems within an organisation is a rich vein of under utilised business opportunity, staffed by fiercely loyal, if not sometimes awkward professionals. Perhaps it is time to start listening to them?
We can only imagine that the number of BCP tests will soar over the next few months and of course the IT staff will be there day and night to execute the plan, even if the shadow of being outsourced is lurking in the corner. In fact, if you are reading this blog next week rather than this, our BCP test hasn’t gone to plan!
If you would like to discuss BCP (our parent company has an investment in some really nifty disc technology which is a bit like Apples’s Time Machine for enterprise storage), contact us at: [email protected] or call 020 7517 3900.