aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--ops/index.md1
-rw-r--r--ops/outages/2023-01-11-unreachable-host.md92
-rw-r--r--ops/outages/index.md5
3 files changed, 98 insertions, 0 deletions
diff --git a/ops/index.md b/ops/index.md
index 3d2d0bc..8165415 100644
--- a/ops/index.md
+++ b/ops/index.md
@@ -22,6 +22,7 @@ Additional resources:
- [PostgreSQL robustness planning](/ops/robust-psql.md)
- [SourceHut scalability plans](/ops/scale.md)
- [Security incident reports](/ops/security-incidents)
+- [Outage incident reports](/ops/outages)
Next available port number: 5016/5116
diff --git a/ops/outages/2023-01-11-unreachable-host.md b/ops/outages/2023-01-11-unreachable-host.md
new file mode 100644
index 0000000..463eb5d
--- /dev/null
+++ b/ops/outages/2023-01-11-unreachable-host.md
@@ -0,0 +1,92 @@
+# Main database server outage 2023-01-11
+
+* See [this incident][status] on status.sr.ht
+
+[status]: https://status.sr.ht/issues/2023-01-11-unreachable-host/
+
+On January 11th at approximately 13:38 UTC our main database server,
+remilia.sr.ht, suffered a hardware failure and became unreachable. The lack of
+database connectivity caused outages across all services.
+
+## Full details
+
+Around 13:40 UTC on January 11th multiple alerts started firing for most
+SourceHut services being unavailable. In addition, one alert was for
+remilia.sr.ht, a physical host and our main database server, also being
+unavailable. Drew determined that the database server itself was completely
+unreachable remotely.
+
+At this point it became clear that we needed someone on-site. Our options for
+this are the DC operator staff ("remote hands"), and our own emergency contact
+in that area, Calvin. Drew called the DC at 8:45 AM local time. The official
+daytime availability for the datacenter begins at 9 AM, and no one picked up the
+phone. We have the ability to page them outside of business hours, but it seemed
+unlikely that this would decrease the response time. Instead, Drew dispatched
+Calvin, who was immediately responsive but was located an hour's travel time
+from the DC. Calvin started making his way to the datacenter.
+
+While waiting, we explored various scenarios and our options to deal with them.
+The plan was to attempt to diagnose the issue with the database server, and, if
+it was unrecoverable, move the hard drives to another host and bring it up as a
+clone of the database server. Our choice of secondary server was cirno1.sr.ht,
+one of our build workers, because it is redundant with cirno2.sr.ht and pulling
+it out of service would not affect service availability.
+
+Once the DC staff was on-site we had them power-cycle the server. It did not
+come back up and was non-responsive. However, this did suggest that the disks
+were not at fault. Shortly thereafter, Calvin arrived on-site. After some more
+diagnostic work, he proceeded to swap disks with cirno1.sr.ht. This brought back
+remilia in its old state, running in cirno1's chassis.
+
+Once the new remilia was up and running, all services started to recover. The
+absence of cirno1 leaves [builds.sr.ht][builds] with reduced build capacity,
+but enabled us to quickly restore general service.
+
+[builds]: https://builds.sr.ht
+
+See the [timeline](#timeline) below for more details.
+
+We'd like to especially thank Calvin for his work on-site.
+
+## Complicating factors
+
+* Only one of three Sourcehut staff would have had a valid passport available
+ in case serious work would have had to happen on-site (US) immediately.
+* While the off-site database backup was working, it could not have served as a
+ failover database host - this has already been fixed.
+
+## Planned mitigations
+
+* We are re-prioritizing the EU migration in general and specifically
+ * new database failover
+ * new EU build worker to backfill build capacity
+* There was initially some confusion about the disk layout in remilia.sr.ht --
+ we will record a better hardware inventory for future reference.
+
+## Timeline
+
+All times in UTC.
+
+**2023-01-11 13:38**: Prometheus [starts failing][metrics_start] to scrape metrics from remilia
+
+[metrics_start]: https://metrics.sr.ht/graph?g0.expr=up%7Binstance%3D%22remilia.sr.ht%3A80%22%7D&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=13m21s70ms&g0.end_input=2023-01-11%2013%3A47%3A15&g0.moment_input=2023-01-11%2013%3A47%3A15
+
+**2023-01-11 13:40**: Alerts start to fire for a wide range of services.
+
+**2023-11-16 13:42**: Finding no means to reach the server remotely, Drew starts making phone calls.
+
+**2023-11-16 13:44**: Postfix is shut down to avoid the mail delivery pipeline spilling over and alertmanager is disabled to silence the myriad of alarms.
+
+**2023-11-16 13:52**: Being unable to reach anyone in the DC, Drew activates emergency on-call contact in DC area. Calvin starts heading to the datacenter.
+
+**2023-11-16 15:03**: Drew reaches DC ops, determines that the hardware (sans the disks) is likely a lost cause.
+
+**2023-11-16 15:39**: On-call contact arrives in DC. After confirming DC ops diagnosis, Calvin swaps the disks with cirno1.sr.ht and brings it online.
+
+**2023-11-16 16:04**: Prometheus [successfully scrapes][metrics_end] metrics from remilia (running in cirno1's chassis).
+
+[metrics_end]: https://metrics.sr.ht/graph?g0.expr=up%7Binstance%3D%22remilia.sr.ht%3A80%22%7D&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=16m27s33ms&g0.end_input=2023-01-11%2016%3A12%3A50&g0.moment_input=2023-01-11%2016%3A12%3A50
+
+**2023-11-16 16:06**: Affected services begin to come back up.
+
+**2023-11-16 16:09**: All services back online.
diff --git a/ops/outages/index.md b/ops/outages/index.md
new file mode 100644
index 0000000..22093d8
--- /dev/null
+++ b/ops/outages/index.md
@@ -0,0 +1,5 @@
+# Service outage post-mortems
+
+May this list never grow.
+
+- [Main database server failure](/ops/outages/2023-01-11-unreachable-host.md)