1.        Introduction

This root cause analysis document describes a brief description of the incident, a summary of the events, a discussion of the causal factors and recommendations for preventing recurrence.

Key terms used in this document are defined below:

Root Cause:     

The underlying reason for the occurrence of a problem usually made up of a number of causal factors.

Causal Factors:

Individual events contributing to the root cause.

Contributory Factors:            

Components (e.g. people, processes, configurations) which contributed to the impact of a problem but were not a result of, or linked to its root cause.

Consequential Issues:   

Issues or incidents arising as a direct result of the problem.

Other Observations:

Lessons learnt during the incident or resolution of the incident

2.        Incident Summary

Service Failure

On Monday, 21st June at 13:35, we were alerted to a fault with our primary fibre circuit from Dunsfold to London. The network is designed to have redundancy to protect against such a failure and in normal circumstances the circuit from Dunsfold to Reading provides this redundancy. In this instance it was discovered that in addition to the fault on the primary circuit, a previously undetected fault on the secondary circuit was also present. This fault rendered it unusable also. The two separate failures resulted in a networking outage for the duration of the incident.

Additionally, the multiple failures impact our ability to invoke DR procedures for our website and communications platforms. This resulted in our ticketing system and phone lines being offline for the duration of the incident.

Workaround implementation / Immediate actions       

Engineers on site were able to ascertain that the infrastructure at the Dunsfold datacentre was functional and that the fault was up-stream of the Dunsfold external routers. Both faults were reported to our vendors with senior escalation teams.

BT Openreach engineers attended both the Dunsfold and London sites at around 18:50 and identified that the fault with the Dunsfold to London line a severed cable in the Clapham area. This enabled them to start work rerouting our circuit.

BT Openreach completed the rerouting of the Dunsfold to London fibre at 00:25, 22nd June.

The fault with the secondary circuit from Dunsfold to Reading was resolved by 11:35, 22nd June. This restored redundancy to the network architecture thereby resuming normal service.

3.        Root Cause Analysis (RCA)

This section looks at what truly caused the incident and what contributed to the impact / duration of the incident.

Root Cause

The root cause of this incident has been identified as two separate faults on the fibre circuits providing network connectivity to the Dunsfold datacentre. These multiple failures resulted in loss of primary networking and the redundancy that was in place.

Causal Factors

The fault on the Dunsfold to London fibre circuit has been identified as a fibre break in the Clapham area of London.

The fault on the secondary circuit from Dunsfold to Reading has been identified as fibre degradation a few miles from the Dunsfold datacentre. The cause of this fibre degradation is currently unknown.

Contributory Factors

We were unaware that the secondary fibre circuit Reading had failed until it was required on Monday. While active, this circuit is currently unused for day-to-day traffic since our migration of customer servers from the Amito datacentre. It was understood that this circuit was provided on a managed basis by Focus Group however investigation after the incident has discovered that this information was not correct. Focus Group were not providing monitoring services.

No testing schedule was in place for the secondary circuit to Reading as this was previously used continuously for day-to-day network traffic.

Consequential Issues

The redundancy in place for our website is reliant on at least one of the fibre circuits being available. Technical limitations meant it was not possible to reroute traffic to our second site at Maidenhead.

Our telephone system resides in our Dunsfold site. Again, this is reliant on at least one of the fibre circuits being functional.

4.        Recommended Actions

Area

Description

Status

Action Plan

Root Cause

Establish the time that fault developed on the secondary circuit

Underway

We are working closely with our vendor to establish when this fault developed

Establish the cause of the fibre degradation of the circuit from Dunsfold to Reading

Underway

We are working closely with our vendor to establish the cause of this fault

Causal Factors

Implement internal monitoring on the Dunsfold to Reading Fibre circuit

Complete

Additional monitoring has been put in place to alert to any degradation of service. 

Contributing Factors

Reconfigure network routing so that a proportion of network traffic is passed over the Dunsfold to Reading fibre circuit

Complete

We are now passing traffic across this circuit

Consequential Issues

Migrate management systems to group systems to provide resilience in case of incident

Underway

A project to carry out this work was underway prior to this incident

5.        Likelihood of Reoccurrence

Risk Matrix

 

 

                          Impact

Minor

Moderate

Major

Critical

1

2

3

4

Likelihood

High

3

3

6

9

12

Moderate

2

2

4

6

8

Low

1

1

2

3

4

Assessment

The likelihood of recurrence has been categorised as between 0-25%. This is due to the following mitigating factors –

  • Redundancy is now back in place following the repair of the secondary circuit between Dunsfold and Reading
  • The secondary circuit is now being actively used for day-to-day network traffic thus increasing any fault visibility
  • Additional monitoring has been put in place and is monitored 24/7

The impact of this incident has been categorised as Critical

The overall risk score has been calculated as 4 - Low



Friday, July 16, 2021

« Back