Add operational documentation

author: Drew DeVault <sir@cmpwn.com> 2020-03-05 17:06:47 -0500
committer: Drew DeVault <sir@cmpwn.com> 2020-03-05 17:06:47 -0500
commit: 747994f68fa7eb21a68e2aa7e04085369b9c92ab (patch)
tree: ef93615cd60cd126714c150833b0fafbac933af7 /ops
parent: 8a8161fd6aec91dae49b57bb3337c0f5dafb1590 (diff)
download: sr.ht-docs-747994f68fa7eb21a68e2aa7e04085369b9c92ab.tar.gz
7 files changed, 484 insertions, 0 deletions
diff --git a/ops/availability.md b/ops/availability.md
new file mode 100644
index 0000000..324df99
--- /dev/null
+++ b/ops/availability.md
@@ -0,0 +1,68 @@
+---
+title: High availability
+---
+
+High-availability has not been a priority for SourceHut during early alpha
+development, but is becoming more important heading into the beta. This page is
+more about our plans than it is about our implementation.
+
+The priorities are, in order:
+
+1. Highly available web services
+2. Highly available database
+3. Highly available mail system
+
+# Web services
+
+The web services are already mostly designed to avoid keeping local state
+around, with this eventual goal in mind. Should investigate load balancing with
+haproxy(?) so we can bring nodes into and out of service without downtime.
+Should also make this the norm for deployments.
+
+## Special considerations for deployments
+
+- SQL migrations should be designed so that both the old and new systems work
+  correctly on both the old and new schemas. This will often require splitting
+  migrations over several releases.
+
+## Special considerations for git.sr.ht, hg.sr.ht
+
+We need to use something like [repospanner](https://github.com/repoSpanner/repoSpanner)
+to distribute git pushes among several nodes.
+
+Can we do something similar for Mercurial?
+
+Related to [backups](/ops/backups.md).
+
+## Special considerations for builds.sr.ht
+
+The builds.sr.ht worker needs to be updated so that we can reboot it without
+terminating anyone's jobs. One idea would be to move the job supervisor into a
+separate process. An issue with this would be the new work scheduler adopting
+job processes after a restart, and avoiding taking on new work from Celery until
+the resources are freed up.
+
+Possible workaround is not accepting new jobs, letting the jobs drain while
+other build hosts pick up the slack, then rebooting and accepting new jobs once
+more.
+
+# Database
+
+????
+
+[pgbouncer](https://www.pgbouncer.org/) will probably be of some use. I suspect
+that we will find it difficult to reach zero-downtime failovers. Ideally, we
+would be able to do PostgreSQL major version upgrades with minimal downtime.
+
+Care will need to be taken to avoid silently dropping writes.
+
+We need to set up an experimental test network for testing out these ideas, and
+make a plan.
+
+# Highly available mail system
+
+This should be fairly trivial. We need to move the work distribution Redis
+server from the mail host to the lists host (duh), and then just set up multiple
+MX records. Zero-downtime migrations can be accomplished by removing an MX
+record, letting the mail flush, and then doing whatever maintenance is
+necessary.
diff --git a/ops/backups.md b/ops/backups.md
new file mode 100644
index 0000000..74de1ec
--- /dev/null
+++ b/ops/backups.md
@@ -0,0 +1,121 @@
+---
+title: "SourceHut backups & redundancy"
+---
+
+The integrity of user data is of paramount importance to SourceHut. Most of our
+data-critical systems are triple-redundant or better.
+
+# Local redundancy
+
+All of our data-critical systems use ZFS with at least 3 drives, which allows up
+to one drive to fail without data loss. Our large storage systems use 5+ drives,
+which allows several drives to fail.
+
+Our standard hardware loadout calls for hard drives (or SSDs) sourced from a
+variety of different vendors and drive models, to avoid using several hard
+drives from the same production batch. This reduces the risk of cascading
+failures during RAID recovery.
+
+## Monitoring
+
+We do an automatic scrub of all ZFS pools on the 1st of each month and forward a
+report to the [ops mailing list][ops ml].
+
+[ops ml]: https://lists.sr.ht/~sircmpwn/sr.ht-ops
+
+## Areas for improvement
+
+1. Automatic ZFS snapshots are only configured for off-site backup hosts. We
+   should configure this on the primary as well. We also need monitoring to
+   ensure that our snapshots are actually being taken.
+2. Investigate something like [repospanner](https://github.com/repoSpanner/repoSpanner)
+   to block git pushes until the data is known to be received and stored across
+   multiple servers - would make git backups real-time
+
+# Off-site backups
+
+We have an off-site backup system in a separate datacenter (in a different city)
+from our primary datacenter. We use borg backup to send backups to this server,
+typically hourly. The standard backup script is:
+
+```
+#!/bin/sh -eu
+export BORG_REPO='ssh://CHANGE ME@konpaku.sr.ht/~/backup'
+export BORG_PASSPHRASE='CHANGE ME'
+
+backup_start="$(date -u +'%s')"
+
+echo "borg create"
+borg create \
+	::git.sr.ht-repos-"$(date +"%Y-%m-%d_%H:%M")" \
+	/var/lib/git \
+	-e /var/lib/git/.ssh \
+	-e /var/lib/git/.gnupg \
+	-e /var/lib/git/.ash_history \
+	-e /var/lib/git/.viminfo \
+	-e '/var/lib/git/*/*/objects/incoming-*' \
+	-e '*.keep' \
+	--compression lz4 \
+	--one-file-system \
+	--info --stats "$@"
+
+echo "borg prune"
+borg prune \
+	--keep-hourly 48 \
+	--keep-daily 60 \
+	--keep-weekly -1 \
+	--info --stats
+
+stats() {
+	backup_end="$(date -u +'%s')"
+	printf '# TYPE last_backup gauge\n'
+	printf '# HELP last_backup Unix timestamp of last backup\n'
+	printf 'last_backup{instance="git.sr.ht"} %d\n' "$backup_end"
+	printf '# TYPE backup_duration gauge\n'
+	printf '# HELP backup_duration Number of seconds most recent backup took to complete\n'
+	printf 'backup_duration{instance="git.sr.ht"} %d\n' "$((backup_end-backup_start))"
+}
+
+stats | curl --data-binary @- https://push.metrics.sr.ht/metrics/job/CHANGE ME
+```
+
+Our `check` script is:
+
+```
+#!/bin/sh -eu
+export BORG_REPO='ssh://CHANGE ME@konpaku.sr.ht/~/backup'
+export BORG_PASSPHRASE='CHANGE ME'
+
+check() {
+	cat <<-EOF
+	To: SourceHut Ops <~sircmpwn/sr.ht-ops@lists.sr.ht>
+	From: CHANGE ME backups <borg@git.sr.ht>
+	Subject: CHANGE ME backups report $(date)
+
+	EOF
+	borg check --last 2 --info 2>&1
+}
+
+check | sendmail '~sircmpwn/sr.ht-ops@lists.sr.ht'
+```
+
+## Monitoring
+
+Each backup reports its timestamp and duration to our Prometheus Pushgateway
+(see [monitoring](/ops/monitoring.md)). We have an alarm configured for when the
+backup age exceeds 48 hours. The age of all borg backups may be [viewed
+here][backup age] (in hours).
+
+We also conduct a weekly `borg check` (on Sunday night, UTC) and forward the
+results to the [ops mailing list][ops ml].
+
+[backup age]: https://metrics.sr.ht/graph?g0.range_input=1h&g0.expr=(time()%20-%20last_backup)%20%2F%2060%20%2F%2060&g0.tab=1
+
+## Areas for improvement
+
+1. Our PostgreSQL replication strategy is somewhat poor, due to several
+   different approaches being experimented with on the same server, and lack of
+   monitoring. This needs to be rethought. Related to
+   [high availability](/ops/availability.md).
+2. It would be nice if we could find a way to encapsulate our borg scripts in an
+   installable Alpine package.
diff --git a/ops/emergency-planning.md b/ops/emergency-planning.md
new file mode 100644
index 0000000..869273a
--- /dev/null
+++ b/ops/emergency-planning.md
@@ -0,0 +1,21 @@
+---
+title: Emergency planning
+---
+
+On several occasions, outages have been simulated and the motions carried out
+for resolving them. This is useful for:
+
+1. Testing that our systems can tolerate or recover from such failures
+2. Familiarizing operators with the resolution procedures
+
+This has been conducted informally. We should put some more structure to it, and
+plan these events regularly.
+
+Ideas:
+
+- Simulate disk failures (yank out a hard drive!)
+- Simulate outages for redundant services
+  (see [availability](/ops/availability.md))
+- Kill celery workers and see how they cope with catching up again
+- Restore systems from backup, then put the restored system into normal service
+  and tear down the original
diff --git a/ops/index.md b/ops/index.md
new file mode 100644
index 0000000..e0814b0
--- /dev/null
+++ b/ops/index.md
@@ -0,0 +1,48 @@
+---
+title: SourceHut operational manual
+---
+
+This subset of the manual documents our approach to the operations and
+maintenance of the hosted service, sr.ht. You may find this useful for running
+your own hosted sr.ht service, or to evaluate our practices & policies to
+consider if they meet your requirements for availability or robustness. You also
+might just find this stuff interesting, as SourceHut is one of the few largeish
+services which is not hosted in The Cloud™.
+
+- [Backups & redundancy](/ops/backups.md)
+- [Emergency planning](/ops/emergency-planning.md)
+- [High availability](/ops/availability.md)
+- [Monitoring & alarms](/ops/monitoring.md)
+- [Network topology](/ops/topology.md)
+- [Provisioning & allocation](/ops/provisioning.md)
+
+# Operational Resources
+
+## Status page
+
+[status.sr.ht](https://status.sr.ht) is hosted on third-party infrastructure and
+is used to communicate about upcoming planned outages, and to provide updates
+during incident resolution. Planned outages are also posted to
+[sr.ht-announce](https://lists.sr.ht/~sircmpwn/sr.ht-announce) in advance.
+
+The status page is updated by a human being, who is probably busy fixing the
+problem. You may want to check the next resource as well:
+
+## Monitoring & alarms
+
+Our Prometheus instance at [metrics.sr.ht](https://metrics.sr.ht) is available
+to the public for querying our monitoring systems and viewing the state of
+various alarms.
+
+## Mailing list
+
+The [sr.ht-ops](https://lists.sr.ht/~sircmpwn/sr.ht-ops) mailing list is used
+for automated reports from our services, including alarm notifications of
+"important" or "urgent" severity, and automated reports on operational status of
+backups and other systems.
+
+## IRC channel
+
+The `#sr.ht.ops` IRC channel on irc.freenode.net is used for triage and
+coordination during outages, and has a real-time feed of alarms raised by our
+monitoring system.
diff --git a/ops/monitoring.md b/ops/monitoring.md
new file mode 100644
index 0000000..833a271
--- /dev/null
+++ b/ops/monitoring.md
@@ -0,0 +1,48 @@
+---
+title: "Monitoring & alarms"
+---
+
+We monitor everything with Prometheus, and configure alarms with alertmanager.
+
+# Public metrics
+
+Our Prometheus instance is publically available at
+[metrics.sr.ht](https://metrics.sr.ht).
+
+## Areas for improvement
+
+1. We should make dashboards. It would be pretty to look at and could be a
+   useful tool for root cause analysis. Note that some users who have their own
+   Grafana instance have pointed it at our public Prometheus data and made some
+   simple dashboards - I would be open to having community ownership over this.
+
+# Pushgateway
+
+A pushgateway is running at push.metrics.sr.ht. It's firewalled to only accept
+connections from [our subnet](/ops/topology.md).
+
+# Aggregation gateway
+
+[prom-aggregation-gateway](https://github.com/weaveworks/prom-aggregation-gateway)
+is running at aggr.metrics.sr.ht. It's firewalled to only accept connections
+from [our subnet](/ops/topology.md).
+
+# Alertmanager
+
+We use alertmanager to forward [alerts](https://metrics.sr.ht/alerts) to various
+sinks.
+
+- **interesting** alerts are forwarded to the IRC channel, #sr.ht.ops
+- **important** alerts are sent the ops mailing list, and the IRC channel
+- **urgent** alerts page Drew's phone, are sent to the mailing list, and the IRC
+  channel
+
+Some security-related alarms are sent directly to Drew and are not made public.
+
+# Areas for improvement
+
+1. Would be nice to have centralized logging. There is sensitive information in
+   some of our logs, so this probably can't be made public.
+2. Several of our physical hosts are not being monitored. This will be resolved
+   during [planned maintenance](https://status.sr.ht/issues/2020-03-11-planned-outage/)
+   on March 3rd, 2020.
diff --git a/ops/provisioning.md b/ops/provisioning.md
new file mode 100644
index 0000000..0bbfba0
--- /dev/null
+++ b/ops/provisioning.md
@@ -0,0 +1,48 @@
+---
+title: "Server provisioning & allocation"
+---
+
+Standards for provisioning of VMs and physical hosts.
+
+# Alpine Linux
+
+Our standard loadout uses Alpine Linux for all hosts and guests.
+
+- **TODO**: ns1 and ns2 are on Debian, they need to be reprovisioned (add an ns3
+  while we're at it?)
+
+# Physical hosts
+
+## VM hosts
+
+Our current VM hosts won't scale much further. We need to figure out a better
+hardware loadout for these going forward.
+
+## High performance VM hosts
+
+For performance critical services (presently only git.sr.ht demands this), we do
+have a standard loadout:
+
+- AMD EPYC 7402 (24 cores, 48 threads)
+- Micron 36ASF2G72PZ-2G6F1 RAM (4x, 64G total)
+- 1x NVMe for host system, ext4; WD Black 1T
+- 4x SSD on SATA, direct passthrough to guest
+
+This is spec'ed to CPU and I/O intensive workloads.
+
+## Build hosts
+
+builds.sr.ht uses dedicated build runners. Our current standard is:
+
+- AS-1013S-MTR SuperMicro barebones
+- AMD EPYC 7281 (16 cores, 32 threads)
+- M393A4K40CB2-CTD RAM (4x, 128G total)
+- 1x NVMe for root, ext4; WD Black 1T
+- 3x HDD on SATA for /var, ZFS; 1T each, various vendors/models
+
+This configuration supports up to 16 parallel build slots.
+
+# Virtual machines
+
+There is no standard loadout - tune the specifications for the task at hand.
+Generally limit 1 VM == 1 service, and tune accordingly.
diff --git a/ops/topology.md b/ops/topology.md
new file mode 100644
index 0000000..95b53b8
--- /dev/null
+++ b/ops/topology.md
@@ -0,0 +1,130 @@
+---
+title: SourceHut network topology
+---
+
+We don't use a NAT (yet). This is a list of where our IP addresses are
+allocated, and the hosts for each VM.
+
+# Subnets
+
+- 173.195.146.128/25 (255.255.255.128)
+- 2604:BF00:710::/64
+
+**Gateways**
+
+- 173.195.146.129
+- 2604:BF00:710::1
+
+# Allocations
+
+**Next IPs**
+
+Virtual machines (grows upwards):
+			
+- 173.195.146.137 (reclaimed from mail.sr.ht v1)
+- 173.195.146.140 (reclaimed from metrics.sr.ht v1)
+- 173.195.146.154
+
+Hosts (grows downwards):
+
+- 173.195.146.243
+
+## cirno1.sr.ht
+
+Purpose: build slave
+
+Host IP: 173.195.146.249, 2604:bf00:710:0:ae1f:6bff:fead:55a
+
+## cirno2.sr.ht
+
+Purpose: build slave
+
+Host IP: 173.195.146.244, 2604:bf00:710:0:ae1f:6bff:fe79:a33e
+
+## yukari.sr.ht
+
+Purpose: file storage
+
+Host IP: 173.195.146.250, 2604:bf00:710:0:230:48ff:fedf:3552
+
+## remilia.sr.ht
+
+Purpose: SQL primary
+
+Host IP: 173.195.146.251, 2604:bf00:710:0:230:48ff:fe7d:599a
+
+## alice1.sr.ht
+
+Purpose: VM host
+
+Host IP: 208.88.54.60 (**old subnet**), 2604:bf00:710:0:225:90ff:fe09:440
+
+### Guests
+
+- packages.knightos.org: 173.195.146.130
+- ns2.sr.ht: 173.195.146.138, 2604:bf00:710:0:5054:ff:fe77:d4b4/64
+- man.sr.ht: 173.195.146.146, 2604:bf00:710:0:5054:ff:feb7:8f16
+- static1.sr.ht: 173.195.146.149, 2604:bf00:710:0:5054:ff:fe0c:b76e
+- metrics.sr.ht: 173.195.146.153, 2604:bf00:710:0:5054:ff:fee3:409
+
+## alice2.sr.ht
+
+Purpose: VM host
+
+Host IP: 208.88.54.61 (**old subnet**), 2604:bf00:710:0:230:48ff:fedc:ab06
+
+### Guests
+
+- drewdevault.com: 173.195.146.133
+- todo.sr.ht: 173.195.146.145, 2604:bf00:710:0:5054:ff:fea1:a941
+- dispatch.sr.ht: 173.195.146.147, 2604:bf00:710:0:5054:ff:fe44:efcb
+- hg.sr.ht: 173.195.146.139, 2604:bf00:710:0:5054:ff:fe25:1aa6
+
+## alice3.sr.ht
+
+Purpose: VM host
+
+Host IP: 208.88.54.62 (**old subnet**), 2604:bf00:710:0:230:48ff:fedb:c474
+
+### Guests
+
+- cmpwn.com: 173.195.146.132
+- ns1.sr.ht: 173.195.146.135, 2604:bf00:710:0:5054:ff:fe65:e92f
+- irc.sircmpwn.com: 173.195.146.141
+- meta.sr.ht: 173.195.146.143, 2604:bf00:710:0:5054:ff:fe97:74b3
+- lists.sr.ht: 173.195.146.144, 2604:bf00:710:0:5054:ff:fec4:6bfb
+- builds.sr.ht: 173.195.146.148, 2604:bf00:710:0:5054:ff:fe50:b8e1
+- paste.sr.ht: 173.195.146.150, 2604:bf00:710:0:5054:ff:fec4:3586
+- mail-b.sr.ht: 173.195.146.151, 2604:bf00:710:0:5054:ff:fee5:c082
+
+## alice4.sr.ht
+
+Purpose: VM host (**out of service**)
+
+Host IP: 173.195.146.248
+
+**Note**: this machine has severe issues, needs to be diagnosed. Don't host
+anything here.
+
+## patchouli.sr.ht
+
+Purpose: blob storage
+
+Host IP: 173.195.146.247, 2604:bf00:710:0:ae1f:6bff:fec2:5502
+
+## flandre.sr.ht
+
+Purpose: ppc64le build host
+
+Host IP: 173.195.146.246, 2604:bf00:710:0:2e09:4dff:fe00:992
+
+## tenshi.sr.ht
+
+Purpose: High-performance VM host
+
+Host IP: 173.195.146.245
+
+### Guests
+
+- git.sr.ht: 173.195.146.142, 2604:bf00:710:0:5054:ff:fe36:ebc6
+- legacy.sr.ht: 173.195.146.152, 2604:bf00:710:0:5054:ff:fe8f:b9de
author	Drew DeVault <sir@cmpwn.com>	2020-03-05 17:06:47 -0500
committer	Drew DeVault <sir@cmpwn.com>	2020-03-05 17:06:47 -0500
commit	747994f68fa7eb21a68e2aa7e04085369b9c92ab (patch)
tree	ef93615cd60cd126714c150833b0fafbac933af7 /ops
parent	8a8161fd6aec91dae49b57bb3337c0f5dafb1590 (diff)
download	sr.ht-docs-747994f68fa7eb21a68e2aa7e04085369b9c92ab.tar.gz