< Back to home 04 Mar 2023

Incident management in tech startups

Working in tech is great, until you get called at 3am to fight a fire because something in the tech you built just broke. These situations are stressful, tiring and in general - not fun. Ideally, they would never happen, but that’s not how the world works.

Over the course of my career, I’ve been in more than a few of these crisis situations, and I’ve learned a few things about how to prepare and deal with them. I kept building up notes, and I thought I’d share them with you.

I’m breaking up this process into 3 parts:

Preparation
Response
Recovery

We’re going to go over each of these parts, and I’ll give you some examples of tools that you can use to help you in each of these phases. I’m also going to share learnings from my own experience and tips on how to improve the process.

Preparation

This phase is the “before it happens” part. It’s the part where you prepare for the worst, and hope for the best.

Knowledge

This one is straightforward - you need to have knowledge of your code as well as the technologies that you’re using. What happens when you don’t? You’re gonna be in a “crash course” situation - you need to learn how something works in very short time under pressure. This is not a good situation to be in. By having good knowledge and understanding of the system, you’ll be able to figure out what’s going on much faster.

Baseline

In order to know what’s wrong, we need to know what’s right. This is where baselines come in. Baselines are a set of metrics that you can use to compare the current state of the system to. This is a great way to know if something is wrong, and if so - what’s wrong. In order to track baselines, you need to capture metrics.

It’s up to you to decide what metrics to track, but here’s a few examples of tools that capture and present metrics well:

APM (Application Performance Monitoring) - NewRelic, Skylight, Datadog, Sentry
Server Metrics - Datadog, Cloudwatch, Grafana
App Logs - Datadog, ELK (or some good old grep-fu)

Detection

Once we know what the “good values” are, we need to know when they go into “bad value” territory. To do that, we need alerts. Some examples:

Availability Monitoring - Pingdom, BetterUptime - will constantly ping endpoints and tell you when these endpoints are not healthy
NewRelic, Datadaog, Cloudwatch - all these systems have alerts built in

Now alerts can be broken into a few levels that warrant attention:

Degradation: The system is available and correct, but slower or with delays
Data corruption: The system is available and performant, but processing data incorrectly
Outage: The system is unavailable - so not performant and not processing data

One extremely important thing about alerts - make sure you your system does not cry wolf.

If you have too many or too flaky alerts, your team will start to ignore them. It’s a natural reaction - but it leads to problems.

You only want alerts that are actionable - if it fired, means the situation must be corrected. Make sure all alerts are delivered both to a location everyone can access (like a Slack channel or email group) as well as to the incident handler

Paging

One important thing in incident management is having a clear process, which means that it starts with someone, the First Responder, which we’ll cover more in the next section.

To formalize this process, establish an efficient on-call schedule, using PagerDuty or BetterUptime.

A good on-call schedule depends on your team size and location. A small team can use a different person on-call each full day, while a larger team can use a rotation. If you have a distributed team, break up the day into manageable shifts that are not too long, ranging from 4-6 hours each, then align these shifts with the working hours of your team members. In such a case you might end up with nobody having to be on-call at night.

For weekend shifts, team members should rotate so that they occur less frequently. To manage weekend shifts, either one person can take the whole weekend every X number of weekends or two people can take a whole day every X number of weekends. It is generally better to have an on-call weekend once every X months rather than having a shift every weekend.

Response

There it is - you have everything in place, and now you’re paged. What do you do?

First off, the notification must be acked by the person on-call as soon as possible or move to another person (escalate). A human must confirm that they are “on it”.

This person is the First Responder. Their job is to estimate the severity of the incident, and to determine if it’s something that can be handled by the them, or if it requires escalation. The First Responder performs the triage, which means answering the following questions:

How serious is the issue?
Which level is it?
- Degradation: The system is available and correct, but slower or with delays
- Data corruption: The system is available and performant, but processing data incorrectly
- Outage: The system is unavailable - so not performant and not processing data
Can the current person/team on call can handle it?
Do they need to call support?
- Are you sure you need to call? Would you like to be called at 1am? Oh, right, you just were. Would you like to be called on your time off?
- If yes, how? Do you have a list of phone numbers?

Second part is to start communicating, on two levels:

Within the company
- Notify that there is an issue and you’re working on it
- Support now knows what’s going on and can handle customer expectations
- Any tech teams not involved in the incident now know to back off to not make the issue worse
Outside the company
- post to Twitter/send emails/update StatusPage
- acknowledge that there is an issue
  - Do you have to?
    - Depends - if it’s a degradation and you caught it early, you can get away with it
    - If it’s an data corruption or outage, you have to - it directly affects your customers

Now it’s time to fight back. First of all, pick the person in command. That’s the Incident Handler. Their job is to coordinate the response and recovery efforts. They are the person who will be in charge of the incident until it’s resolved. The First Responder can be the Incident Handler, but it’s not a must. What is important it that it’s crystal clear who is in charge.

Once it’s clear who is in charge, the Incident Handler builds the Attack Plan - that means figuring out what’s wrong, coming up with a fix, and implementing it.

First, open up a single communication channel - Slack, Google Hangouts, Zoom, whatever you use.

Second, decide who’s doing what - the Incident Handler coordinates that so there is no double work.

Third, start the diagnosis.

Diagnosis

This is where we try to figure out what went wrong. The First Responder should already have done some of the work, but the cause might be now known yet. Your best bet is to be methodical - start where it fails and keep moving down the stack. Even though it might feel that it’s more time consuming, this way you actually save time - by not jumping around.

This is where the knowledge and baselines come in handy. Graphs will show you which part of the system is behaving incorrectly.

You know where the alert happened - start there.

Increased response time? Look at top transactions

Increased error rates? Look at top errors

Increased request routing? Look at slowest requests

Along the way, make sure you keep communicating. The support team will appreciate updates so they can keep customers informed.

The goal of this step is to understand what’s wrong. Next step is the fix.

Fix

You found the problem. If it’s a simple fix - great, you’re done. And yes, a rollback is a simple fix.

What is critical - never push multiple fixes at once - won’t know which one fixed the problem - or maybe one fixed and the other broke it again (or even worse - broke something else).

However, many times the problem is larger and you need to decide whether to hotfix or do a proper fix. Hotfixes are good, because they stabilize the situation. If there is a possible hotfix, apply it - this will buy you time to focus on a proper fix.

If you can’t apply the proper fix now (because it’s too complex, or you don’t have the time), then you need to make sure you schedule the proper fix - write down what needs to be done and agree it will be done with 24/48/72 hours.

Post-fix

Job done, off to the pub? Not quite. Now you get to spend time watching graphs.

Your fixed may have worked, but maybe there was another issue hiding underneath. Or maybe the fix did not work. Or maybe your fix had another bug. You need to ensure that the system is stable, which means you need to spend extra time just watching the system after the issue is resolved. However, you also want to communicate this step - support can start reaching out to customers, and you can update your status page.

When you’re confident, you can communicate to everyone that the issue is resolved.

High fives are in order, you survived and handled the crisis.

Recovery

Now that the incident is over, it’s time to learn from it. The Incident Handler should write a post-mortem, which is a document that describes the incident, what went wrong, and what can be done to prevent it from happening again.

Post-mortem

It does not need to be a book. It should be short and to the point. It should be written as soon as possible after the incident, while the details are still fresh in your mind. It should include the following sections:

Incident description
- Duration
- How it manifested
- Who was affected
Root cause analysis
- What really happened
- What caused it
How will we prevent it in the future?
- Action Items
  - Code changes
  - Process changes
  - Infrastructure changes
- All action items need to be actioned
  - Trello Cards/Jira Tasks/Whatever
  - They need to be put into whatever process you have

The Team

Make sure you appreciate what the team just did. Praise the responders.

Someone worked through the night? Next day off. Worked over the weekend? Monday off.

How about bonuses for people on call? Especially when their on-calls are in non-working hours. You know, paid overtime.

Things to watch out for

Here are a few things that don’t necessarily fit into the above, but are important to watch out for.

Bus factor

How many people need to be hit by a bus for nobody to know how something works?

In general, you want a bus factor of more than 1.

That means the people can go on vacation and the team will still be able to handle incidents.

Heroics

If you have someone involved in every firefight, you have a problem. You need to make sure that the team is not relying on a single person - you want the whole team to grow and be able to handle incidents.

Burnout

An issue a day will depress any squad.

Half-assed fixes

It was fixed the problem in the heat of the moment, but it’s not finished and can blow up at any moment.