Tracking Production Disruptions
stack by striatic is licensed under CC BY
Let us first iron out what constitutes a disruption:
Any unexpected degradation or interruption of a service that in any method affects users ability to the service.
When operating a production service, it is simply a matter of time till a disruptions happens. No matter the safeguards you’ve put in place, there is always the expected scenarios. There are many ways to prepare and to ensure that your services are as robust as possible, but that is a task left to you the reader. In this post I’ll be talking about what to do after you’ve handled a production disruption.
A disruption can be a chaotic event, lasting from a few minutes to multiple hours. The main purpose of tracking these disruptions is to have a better understanding of:
I highly encourage disseminating the learnings around all aspects of a disruption to the rest of the team and organization. A tracked disruption can be used in a post-mortem. The ultimate goal is by educating everyone it will lead to more resilient services and better processes during the inevitable future disruptions.
In addition, with tracked disruptions, it becomes easy to perform basic analysis of the frequency and length of disruptions.
Without any tracking, it becomes a huge failure if the same disruption happens again over and over.
One thing to note is that the following is from experience that I have working at theScore, along with reading various sources online. Let us touch on what to track, where to track, and how to track.
The more information you have the better you’ll understand the disruption if you weren’t immediately involved in it. Additionally, in most cases, multiple people are involved and so not everyone had the same exposure to the entire event. The following is a list and description of what we track at theScore for disruptions.
To be honest, the none of this matters matter if you aren’t tracking disruptions to any capacity, just do it! Even if you have all your disruptions in a single text file, it’s better than nothing. Let’s get realistic though, we can do much better. For starters, you could opt for a spreadsheet or a basic database, maybe even a heavier integrated service like VictorOps.
At theScore, we use a GitHub repository where each issue is a disruption report. We record the aforementioned data points in each issue’s description. We decided to use GitHub as it:
For us, our action items become GitHub tasks with a mentioned individual or team. Issues are to be closed when all action items are resolved, and after sufficient discussion/retrospection has been carried out.
After a disruption, within a few days, it should be tracked. Waiting too long to track a disruption will result in fuzzy/shallow details. There much special in building up the report, although from personal experience it is good for people to jot notes down immediately after a disruption. I also find it effective to create a specific channel or chatroom in your team’s messaging application when dealing with disruptions as all the communication is located in one area.
PagerDuty has some fantastic material on a post-mortem process and even a post-mortem template. They also document defined roles during a disruption, in which the Scribe is a critical role for tracking purposes.
A disruption report can be a single person’s responsibility, although it usually helps to reach out to the people involved for clarifying details.
With all these disruptions tracked, what can we do with them? Hopefully, all the action items and retrospection have led to large gains. Additionally, there is a lot of value in the data itself, it is simply a matter of finding the signals you want to pay attention too. For example:
Find aspects that your organization cares about and measure them.
Never stop and periodically revisit the process to make adjustments. What works for one team, might not work for others. Even though tracking a disruption can be seen as additional work, it will pay itself off in time.