diff options
Diffstat (limited to 'ops')
-rw-r--r-- | ops/incident.md | 88 | ||||
-rw-r--r-- | ops/index.md | 1 |
2 files changed, 89 insertions, 0 deletions
diff --git a/ops/incident.md b/ops/incident.md new file mode 100644 index 0000000..1f52817 --- /dev/null +++ b/ops/incident.md @@ -0,0 +1,88 @@ +--- +title: Incident response +--- + +So, everything is on fire. What do you do? + +## Don't panic + +Take a deep breath. Panicked sysadmins make mistakes. Just relax! An outage is +just a bug that needs to be fixed sooner than most. + +If you're feeling overwhelmed, stop and take a break. Brew a cup of coffee. Top +up your water bottle. + +## Urgency levels + +We have three levels of urgency: + +- **urgent**: requires an immediate response +- **important**: requires a response within 24 hours +- **interesting**: does not require a timely response + +**urgent** is used, for example, when a service is down. **important** is used +when there is an imminent (but not immediate) concern, such as when a disk +is almost full. **important** incidents require a *response* within 24 hours, +but that response may involve making a plan which takes more than 24 hours to +fully resolve. + +Alarms from metrics.sr.ht have an urgency level associated with them which +automatically prioritizes the incident accordingly. + +## Incident response process + +The goal of an incident response is: + +1. Understand the problem +1. Solve the problem +1. Prevent it from happening again + +In the case of an **urgent** incident, step 3 can be treated as **important** +priority following the restoration of service. + +## Point admin + +If the problem is in your domain, notify #sr.ht.ops that you'll be taking point +and start investigating. The point person is the only one who can take actions +which change the state of the system, such as: + +- Updating status.sr.ht +- Restarting services +- SQL mutations +- Editing config files +- etc + +So long as the deployment pipeline is still working (including the availability +of an authorized person to merge into the affected repository), prefer to deploy +new releases to solve problems rather than hotfixing. + +## Peanut gallery + +The term is used lovingly, I promise. Onlookers can offer a support role to the +point person, be they other SourceHut sysadmins or members of the SourceHut +community. Their role is doing independent research and offering useful +information and suggestions to the point person based on their findings. This +includes: + +- Reading service logs +- Reading relevant source code +- Reading documentation +- Executing read-only SQL queries +- Acting as a sounding board for the point person + +The point person is in charge. They also do not have to listen or respond to +your messages while they are busy dealing with the problem. + +The point person may delegate any of their tasks to a member of the peanut +gallery if they see fit. + +## After the incident + +A brief explanation of the problem and its solution is generally welcome on +status.sr.ht, to keep users in the loop. Aim for honest and transparency, and +don't shy away from technical details. + +Generally, everyone who needs to be in the loop will be in the loop during an +incident. But, if not, make sure that any stakeholders understand what happened, +how it was resolved, and what mitigations were established to prevent it from +happening again. diff --git a/ops/index.md b/ops/index.md index 62cbe00..8e01320 100644 --- a/ops/index.md +++ b/ops/index.md @@ -16,6 +16,7 @@ Additional resources: - [Emergency planning](/ops/emergency-planning.md) - [High availability](/ops/availability.md) - [Monitoring & alarms](/ops/monitoring.md) +- [Outage incident response](/ops/incident.md) - [Network topology](/ops/topology.md) - [Provisioning & allocation](/ops/provisioning.md) - [PostgreSQL robustness planning](/ops/robust-psql.md) |