--- title: Incident response --- So, everything is on fire. What do you do? ## Don't panic Take a deep breath. Panicked sysadmins make mistakes. Just relax! An outage is just a bug that needs to be fixed sooner than most. If you're feeling overwhelmed, stop and take a break. Brew a cup of coffee. Top up your water bottle. ## Urgency levels We have three levels of urgency: - **urgent**: requires an immediate response - **important**: requires a response within 24 hours - **interesting**: does not require a timely response **urgent** is used, for example, when a service is down. **important** is used when there is an imminent (but not immediate) concern, such as when a disk is almost full. **important** incidents require a *response* within 24 hours, but that response may involve making a plan which takes more than 24 hours to fully resolve. Alarms from metrics.sr.ht have an urgency level associated with them which automatically prioritizes the incident accordingly. ## Incident response process The goal of an incident response is: 1. Understand the problem 1. Solve the problem 1. Prevent it from happening again In the case of an **urgent** incident, step 3 can be treated as **important** priority following the restoration of service. ## Point admin If the problem is in your domain, notify #sr.ht.ops that you'll be taking point and start investigating. The point person is the only one who can take actions which change the state of the system, such as: - Updating status.sr.ht - Restarting services - SQL mutations - Editing config files - etc So long as the deployment pipeline is still working (including the availability of an authorized person to merge into the affected repository), prefer to deploy new releases to solve problems rather than hotfixing. ## Peanut gallery The term is used lovingly, I promise. Onlookers can offer a support role to the point person, be they other SourceHut sysadmins or members of the SourceHut community. Their role is doing independent research and offering useful information and suggestions to the point person based on their findings. This includes: - Reading service logs - Reading relevant source code - Reading documentation - Executing read-only SQL queries - Acting as a sounding board for the point person The point person is in charge. They also do not have to listen or respond to your messages while they are busy dealing with the problem. The point person may delegate any of their tasks to a member of the peanut gallery if they see fit. ## After the incident A brief explanation of the problem and its solution is generally welcome on status.sr.ht, to keep users in the loop. Aim for honest and transparency, and don't shy away from technical details. Generally, everyone who needs to be in the loop will be in the loop during an incident. But, if not, make sure that any stakeholders understand what happened, how it was resolved, and what mitigations were established to prevent it from happening again.