aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--ops/incident.md88
-rw-r--r--ops/index.md1
2 files changed, 89 insertions, 0 deletions
diff --git a/ops/incident.md b/ops/incident.md
new file mode 100644
index 0000000..1f52817
--- /dev/null
+++ b/ops/incident.md
@@ -0,0 +1,88 @@
+---
+title: Incident response
+---
+
+So, everything is on fire. What do you do?
+
+## Don't panic
+
+Take a deep breath. Panicked sysadmins make mistakes. Just relax! An outage is
+just a bug that needs to be fixed sooner than most.
+
+If you're feeling overwhelmed, stop and take a break. Brew a cup of coffee. Top
+up your water bottle.
+
+## Urgency levels
+
+We have three levels of urgency:
+
+- **urgent**: requires an immediate response
+- **important**: requires a response within 24 hours
+- **interesting**: does not require a timely response
+
+**urgent** is used, for example, when a service is down. **important** is used
+when there is an imminent (but not immediate) concern, such as when a disk
+is almost full. **important** incidents require a *response* within 24 hours,
+but that response may involve making a plan which takes more than 24 hours to
+fully resolve.
+
+Alarms from metrics.sr.ht have an urgency level associated with them which
+automatically prioritizes the incident accordingly.
+
+## Incident response process
+
+The goal of an incident response is:
+
+1. Understand the problem
+1. Solve the problem
+1. Prevent it from happening again
+
+In the case of an **urgent** incident, step 3 can be treated as **important**
+priority following the restoration of service.
+
+## Point admin
+
+If the problem is in your domain, notify #sr.ht.ops that you'll be taking point
+and start investigating. The point person is the only one who can take actions
+which change the state of the system, such as:
+
+- Updating status.sr.ht
+- Restarting services
+- SQL mutations
+- Editing config files
+- etc
+
+So long as the deployment pipeline is still working (including the availability
+of an authorized person to merge into the affected repository), prefer to deploy
+new releases to solve problems rather than hotfixing.
+
+## Peanut gallery
+
+The term is used lovingly, I promise. Onlookers can offer a support role to the
+point person, be they other SourceHut sysadmins or members of the SourceHut
+community. Their role is doing independent research and offering useful
+information and suggestions to the point person based on their findings. This
+includes:
+
+- Reading service logs
+- Reading relevant source code
+- Reading documentation
+- Executing read-only SQL queries
+- Acting as a sounding board for the point person
+
+The point person is in charge. They also do not have to listen or respond to
+your messages while they are busy dealing with the problem.
+
+The point person may delegate any of their tasks to a member of the peanut
+gallery if they see fit.
+
+## After the incident
+
+A brief explanation of the problem and its solution is generally welcome on
+status.sr.ht, to keep users in the loop. Aim for honest and transparency, and
+don't shy away from technical details.
+
+Generally, everyone who needs to be in the loop will be in the loop during an
+incident. But, if not, make sure that any stakeholders understand what happened,
+how it was resolved, and what mitigations were established to prevent it from
+happening again.
diff --git a/ops/index.md b/ops/index.md
index 62cbe00..8e01320 100644
--- a/ops/index.md
+++ b/ops/index.md
@@ -16,6 +16,7 @@ Additional resources:
- [Emergency planning](/ops/emergency-planning.md)
- [High availability](/ops/availability.md)
- [Monitoring & alarms](/ops/monitoring.md)
+- [Outage incident response](/ops/incident.md)
- [Network topology](/ops/topology.md)
- [Provisioning & allocation](/ops/provisioning.md)
- [PostgreSQL robustness planning](/ops/robust-psql.md)