diff options
-rw-r--r-- | ops/index.md | 1 | ||||
-rw-r--r-- | ops/outages/2023-01-11-unreachable-host.md | 92 | ||||
-rw-r--r-- | ops/outages/index.md | 5 |
3 files changed, 98 insertions, 0 deletions
diff --git a/ops/index.md b/ops/index.md index 3d2d0bc..8165415 100644 --- a/ops/index.md +++ b/ops/index.md @@ -22,6 +22,7 @@ Additional resources: - [PostgreSQL robustness planning](/ops/robust-psql.md) - [SourceHut scalability plans](/ops/scale.md) - [Security incident reports](/ops/security-incidents) +- [Outage incident reports](/ops/outages) Next available port number: 5016/5116 diff --git a/ops/outages/2023-01-11-unreachable-host.md b/ops/outages/2023-01-11-unreachable-host.md new file mode 100644 index 0000000..463eb5d --- /dev/null +++ b/ops/outages/2023-01-11-unreachable-host.md @@ -0,0 +1,92 @@ +# Main database server outage 2023-01-11 + +* See [this incident][status] on status.sr.ht + +[status]: https://status.sr.ht/issues/2023-01-11-unreachable-host/ + +On January 11th at approximately 13:38 UTC our main database server, +remilia.sr.ht, suffered a hardware failure and became unreachable. The lack of +database connectivity caused outages across all services. + +## Full details + +Around 13:40 UTC on January 11th multiple alerts started firing for most +SourceHut services being unavailable. In addition, one alert was for +remilia.sr.ht, a physical host and our main database server, also being +unavailable. Drew determined that the database server itself was completely +unreachable remotely. + +At this point it became clear that we needed someone on-site. Our options for +this are the DC operator staff ("remote hands"), and our own emergency contact +in that area, Calvin. Drew called the DC at 8:45 AM local time. The official +daytime availability for the datacenter begins at 9 AM, and no one picked up the +phone. We have the ability to page them outside of business hours, but it seemed +unlikely that this would decrease the response time. Instead, Drew dispatched +Calvin, who was immediately responsive but was located an hour's travel time +from the DC. Calvin started making his way to the datacenter. + +While waiting, we explored various scenarios and our options to deal with them. +The plan was to attempt to diagnose the issue with the database server, and, if +it was unrecoverable, move the hard drives to another host and bring it up as a +clone of the database server. Our choice of secondary server was cirno1.sr.ht, +one of our build workers, because it is redundant with cirno2.sr.ht and pulling +it out of service would not affect service availability. + +Once the DC staff was on-site we had them power-cycle the server. It did not +come back up and was non-responsive. However, this did suggest that the disks +were not at fault. Shortly thereafter, Calvin arrived on-site. After some more +diagnostic work, he proceeded to swap disks with cirno1.sr.ht. This brought back +remilia in its old state, running in cirno1's chassis. + +Once the new remilia was up and running, all services started to recover. The +absence of cirno1 leaves [builds.sr.ht][builds] with reduced build capacity, +but enabled us to quickly restore general service. + +[builds]: https://builds.sr.ht + +See the [timeline](#timeline) below for more details. + +We'd like to especially thank Calvin for his work on-site. + +## Complicating factors + +* Only one of three Sourcehut staff would have had a valid passport available + in case serious work would have had to happen on-site (US) immediately. +* While the off-site database backup was working, it could not have served as a + failover database host - this has already been fixed. + +## Planned mitigations + +* We are re-prioritizing the EU migration in general and specifically + * new database failover + * new EU build worker to backfill build capacity +* There was initially some confusion about the disk layout in remilia.sr.ht -- + we will record a better hardware inventory for future reference. + +## Timeline + +All times in UTC. + +**2023-01-11 13:38**: Prometheus [starts failing][metrics_start] to scrape metrics from remilia + +[metrics_start]: https://metrics.sr.ht/graph?g0.expr=up%7Binstance%3D%22remilia.sr.ht%3A80%22%7D&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=13m21s70ms&g0.end_input=2023-01-11%2013%3A47%3A15&g0.moment_input=2023-01-11%2013%3A47%3A15 + +**2023-01-11 13:40**: Alerts start to fire for a wide range of services. + +**2023-11-16 13:42**: Finding no means to reach the server remotely, Drew starts making phone calls. + +**2023-11-16 13:44**: Postfix is shut down to avoid the mail delivery pipeline spilling over and alertmanager is disabled to silence the myriad of alarms. + +**2023-11-16 13:52**: Being unable to reach anyone in the DC, Drew activates emergency on-call contact in DC area. Calvin starts heading to the datacenter. + +**2023-11-16 15:03**: Drew reaches DC ops, determines that the hardware (sans the disks) is likely a lost cause. + +**2023-11-16 15:39**: On-call contact arrives in DC. After confirming DC ops diagnosis, Calvin swaps the disks with cirno1.sr.ht and brings it online. + +**2023-11-16 16:04**: Prometheus [successfully scrapes][metrics_end] metrics from remilia (running in cirno1's chassis). + +[metrics_end]: https://metrics.sr.ht/graph?g0.expr=up%7Binstance%3D%22remilia.sr.ht%3A80%22%7D&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=16m27s33ms&g0.end_input=2023-01-11%2016%3A12%3A50&g0.moment_input=2023-01-11%2016%3A12%3A50 + +**2023-11-16 16:06**: Affected services begin to come back up. + +**2023-11-16 16:09**: All services back online. diff --git a/ops/outages/index.md b/ops/outages/index.md new file mode 100644 index 0000000..22093d8 --- /dev/null +++ b/ops/outages/index.md @@ -0,0 +1,5 @@ +# Service outage post-mortems + +May this list never grow. + +- [Main database server failure](/ops/outages/2023-01-11-unreachable-host.md) |