aboutsummaryrefslogtreecommitdiffstats
path: root/ops/incident.md
blob: 1f52817e8d1abbe39b3c01e84b13708fefe2b142 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
title: Incident response
---

So, everything is on fire. What do you do?

## Don't panic

Take a deep breath. Panicked sysadmins make mistakes. Just relax! An outage is
just a bug that needs to be fixed sooner than most.

If you're feeling overwhelmed, stop and take a break. Brew a cup of coffee. Top
up your water bottle.

## Urgency levels

We have three levels of urgency:

- **urgent**: requires an immediate response
- **important**: requires a response within 24 hours
- **interesting**: does not require a timely response

**urgent** is used, for example, when a service is down. **important** is used
when there is an imminent (but not immediate) concern, such as when a disk
is almost full. **important** incidents require a *response* within 24 hours,
but that response may involve making a plan which takes more than 24 hours to
fully resolve.

Alarms from metrics.sr.ht have an urgency level associated with them which
automatically prioritizes the incident accordingly.

## Incident response process

The goal of an incident response is:

1. Understand the problem
1. Solve the problem
1. Prevent it from happening again

In the case of an **urgent** incident, step 3 can be treated as **important**
priority following the restoration of service.

## Point admin

If the problem is in your domain, notify #sr.ht.ops that you'll be taking point
and start investigating. The point person is the only one who can take actions
which change the state of the system, such as:

- Updating status.sr.ht
- Restarting services
- SQL mutations
- Editing config files
- etc

So long as the deployment pipeline is still working (including the availability
of an authorized person to merge into the affected repository), prefer to deploy
new releases to solve problems rather than hotfixing.

## Peanut gallery

The term is used lovingly, I promise. Onlookers can offer a support role to the
point person, be they other SourceHut sysadmins or members of the SourceHut
community. Their role is doing independent research and offering useful
information and suggestions to the point person based on their findings. This
includes:

- Reading service logs
- Reading relevant source code
- Reading documentation
- Executing read-only SQL queries
- Acting as a sounding board for the point person

The point person is in charge. They also do not have to listen or respond to
your messages while they are busy dealing with the problem.

The point person may delegate any of their tasks to a member of the peanut
gallery if they see fit.

## After the incident

A brief explanation of the problem and its solution is generally welcome on
status.sr.ht, to keep users in the loop. Aim for honest and transparency, and
don't shy away from technical details.

Generally, everyone who needs to be in the loop will be in the loop during an
incident. But, if not, make sure that any stakeholders understand what happened,
how it was resolved, and what mitigations were established to prevent it from
happening again.