Blog Post from our UK partner, KCS Group:
On Saturday 27th May 2017, as many travellers were beginning their journey to enjoy some spring sunshine over the bank holiday weekend, British Airways experienced a catastrophic failure of their IT systems leading to a virtual grounding of the national carrier. This resulted in over 700 cancelled flights during one of the busiest holiday periods of the year.
A Business Impact Assessment identifies all discrete self contained systems such as your email platform and assesses how it’s loss would effect the business as a whole.
Systems whose loss or compromise would immediately stop the company’s ability to function are deemed to be ‘Critical Systems’ and must be accounted for in the Business Continuity Plan.
A Business Continuity Plan (BCP) is a set of documented procedures that take place during a crisis to enable your company to maintain a basic level of operations.
BCP’s vary in complexity and but at its most basic will identify the systems critical to running your business and outline how these systems will maintain ‘Continuity’ if for example, your site loses power.
A Failover Plan is part of the BCP. It contains the tested procedures that should be followed in the case of a critical system needed to be failed over to another site.
Systems identified as ‘Critical’ during the Business Impact Assessment should be designed to be Highly Available with failover being relatively seamless to minimise outages.
An Uninterruptable Power Supply (UPS) is essentially a large battery.
UPS’s charge up from the mains while operations are normal. If mains power is cut then it will power the systems it is connected to for a short period to allow for backup power to be switched on or for the systems to be properly shut down and/or failed over.
According to BA, a ‘power surge’ occurred at British Airways Primary Datacentre on the morning of the 27th May 2017.
Source inside BA have stated that the ‘power surge’ occurred at approximately 0930 BST.
British Airways has 2 datacentres in the London which house the systems used for ticketing and seat reservations.
Boadicea House and Comet House/Cranebank are datacentres that are said to compromise of 500 cabinets in six halls which are within a mile radius of the eastern end of Heathrow’s 2 runways.
The National Grid and Scottish & Southern Electricity Network (the local electricity supplier) have confirmed to news outlets that the power surge did not originate on their systems.
This does not however rule out a power surge originating inside Boadicea House itself.
The power surge caused the Boadicea House systems to go down for approximately 15 minutes.
Normally if power is lost to a datacentre the UPS will be able to power the systems for long enough to bring the backup generators online to maintain continuity of the systems. That did not occur in this case.
There was no automatic failover to other HA site.
From information given to theregister.co.uk online news site, the data centres normally run in what is called active/active mode which means that both datacentres are on and able to provide functionality. This should result in the second datacentre picking up the slack almost immediately when the first went down.
Bill Francis, Head of Group IT at BA’s owner International Airlines Group (IAG) has been quoted as saying that a core UPS in the datacentre was overridden on Saturday morning, causing physical damage toe the power systems.
Bill goes on to say…
“This resulted in the total immediate loss of power to the facility, bypassing the backup generators and batteries…After a few minutes of this shutdown of power, it was turned back on in an unplanned and uncontrolled fashion, which created physical damage to the system and exacerbated the problem”
The UK’s Daily Mail Newspaper has pointed the finger at a contractor working for CBRE Global Workspace Solutions who manage the datacentre on behalf of BA.
BA has not yet confirmed or denied this but Alex Cruz, BA’s Chief Exec seemingly chose his words carefully when asked whether outsourcing was the cause of this issue, insisting that the infrastructure was maintained by ‘local’ people.
While this issue may not fall into the classic news hype bucket of a Security Issue, proper Disaster Recovery and Business Continuity Planning has been the bedrock of the security professional’s work for decades. Traditionally however it has had little attention from the board who too often regard it as the domain of the ‘IT guys’ or something that only large companies need worry about.