diff options
author | Drew DeVault <sir@cmpwn.com> | 2020-03-05 17:06:47 -0500 |
---|---|---|
committer | Drew DeVault <sir@cmpwn.com> | 2020-03-05 17:06:47 -0500 |
commit | 747994f68fa7eb21a68e2aa7e04085369b9c92ab (patch) | |
tree | ef93615cd60cd126714c150833b0fafbac933af7 /ops | |
parent | 8a8161fd6aec91dae49b57bb3337c0f5dafb1590 (diff) | |
download | sr.ht-docs-747994f68fa7eb21a68e2aa7e04085369b9c92ab.tar.gz |
Add operational documentation
Diffstat (limited to 'ops')
-rw-r--r-- | ops/availability.md | 68 | ||||
-rw-r--r-- | ops/backups.md | 121 | ||||
-rw-r--r-- | ops/emergency-planning.md | 21 | ||||
-rw-r--r-- | ops/index.md | 48 | ||||
-rw-r--r-- | ops/monitoring.md | 48 | ||||
-rw-r--r-- | ops/provisioning.md | 48 | ||||
-rw-r--r-- | ops/topology.md | 130 |
7 files changed, 484 insertions, 0 deletions
diff --git a/ops/availability.md b/ops/availability.md new file mode 100644 index 0000000..324df99 --- /dev/null +++ b/ops/availability.md @@ -0,0 +1,68 @@ +--- +title: High availability +--- + +High-availability has not been a priority for SourceHut during early alpha +development, but is becoming more important heading into the beta. This page is +more about our plans than it is about our implementation. + +The priorities are, in order: + +1. Highly available web services +2. Highly available database +3. Highly available mail system + +# Web services + +The web services are already mostly designed to avoid keeping local state +around, with this eventual goal in mind. Should investigate load balancing with +haproxy(?) so we can bring nodes into and out of service without downtime. +Should also make this the norm for deployments. + +## Special considerations for deployments + +- SQL migrations should be designed so that both the old and new systems work + correctly on both the old and new schemas. This will often require splitting + migrations over several releases. + +## Special considerations for git.sr.ht, hg.sr.ht + +We need to use something like [repospanner](https://github.com/repoSpanner/repoSpanner) +to distribute git pushes among several nodes. + +Can we do something similar for Mercurial? + +Related to [backups](/ops/backups.md). + +## Special considerations for builds.sr.ht + +The builds.sr.ht worker needs to be updated so that we can reboot it without +terminating anyone's jobs. One idea would be to move the job supervisor into a +separate process. An issue with this would be the new work scheduler adopting +job processes after a restart, and avoiding taking on new work from Celery until +the resources are freed up. + +Possible workaround is not accepting new jobs, letting the jobs drain while +other build hosts pick up the slack, then rebooting and accepting new jobs once +more. + +# Database + +???? + +[pgbouncer](https://www.pgbouncer.org/) will probably be of some use. I suspect +that we will find it difficult to reach zero-downtime failovers. Ideally, we +would be able to do PostgreSQL major version upgrades with minimal downtime. + +Care will need to be taken to avoid silently dropping writes. + +We need to set up an experimental test network for testing out these ideas, and +make a plan. + +# Highly available mail system + +This should be fairly trivial. We need to move the work distribution Redis +server from the mail host to the lists host (duh), and then just set up multiple +MX records. Zero-downtime migrations can be accomplished by removing an MX +record, letting the mail flush, and then doing whatever maintenance is +necessary. diff --git a/ops/backups.md b/ops/backups.md new file mode 100644 index 0000000..74de1ec --- /dev/null +++ b/ops/backups.md @@ -0,0 +1,121 @@ +--- +title: "SourceHut backups & redundancy" +--- + +The integrity of user data is of paramount importance to SourceHut. Most of our +data-critical systems are triple-redundant or better. + +# Local redundancy + +All of our data-critical systems use ZFS with at least 3 drives, which allows up +to one drive to fail without data loss. Our large storage systems use 5+ drives, +which allows several drives to fail. + +Our standard hardware loadout calls for hard drives (or SSDs) sourced from a +variety of different vendors and drive models, to avoid using several hard +drives from the same production batch. This reduces the risk of cascading +failures during RAID recovery. + +## Monitoring + +We do an automatic scrub of all ZFS pools on the 1st of each month and forward a +report to the [ops mailing list][ops ml]. + +[ops ml]: https://lists.sr.ht/~sircmpwn/sr.ht-ops + +## Areas for improvement + +1. Automatic ZFS snapshots are only configured for off-site backup hosts. We + should configure this on the primary as well. We also need monitoring to + ensure that our snapshots are actually being taken. +2. Investigate something like [repospanner](https://github.com/repoSpanner/repoSpanner) + to block git pushes until the data is known to be received and stored across + multiple servers - would make git backups real-time + +# Off-site backups + +We have an off-site backup system in a separate datacenter (in a different city) +from our primary datacenter. We use borg backup to send backups to this server, +typically hourly. The standard backup script is: + +``` +#!/bin/sh -eu +export BORG_REPO='ssh://CHANGE ME@konpaku.sr.ht/~/backup' +export BORG_PASSPHRASE='CHANGE ME' + +backup_start="$(date -u +'%s')" + +echo "borg create" +borg create \ + ::git.sr.ht-repos-"$(date +"%Y-%m-%d_%H:%M")" \ + /var/lib/git \ + -e /var/lib/git/.ssh \ + -e /var/lib/git/.gnupg \ + -e /var/lib/git/.ash_history \ + -e /var/lib/git/.viminfo \ + -e '/var/lib/git/*/*/objects/incoming-*' \ + -e '*.keep' \ + --compression lz4 \ + --one-file-system \ + --info --stats "$@" + +echo "borg prune" +borg prune \ + --keep-hourly 48 \ + --keep-daily 60 \ + --keep-weekly -1 \ + --info --stats + +stats() { + backup_end="$(date -u +'%s')" + printf '# TYPE last_backup gauge\n' + printf '# HELP last_backup Unix timestamp of last backup\n' + printf 'last_backup{instance="git.sr.ht"} %d\n' "$backup_end" + printf '# TYPE backup_duration gauge\n' + printf '# HELP backup_duration Number of seconds most recent backup took to complete\n' + printf 'backup_duration{instance="git.sr.ht"} %d\n' "$((backup_end-backup_start))" +} + +stats | curl --data-binary @- https://push.metrics.sr.ht/metrics/job/CHANGE ME +``` + +Our `check` script is: + +``` +#!/bin/sh -eu +export BORG_REPO='ssh://CHANGE ME@konpaku.sr.ht/~/backup' +export BORG_PASSPHRASE='CHANGE ME' + +check() { + cat <<-EOF + To: SourceHut Ops <~sircmpwn/sr.ht-ops@lists.sr.ht> + From: CHANGE ME backups <borg@git.sr.ht> + Subject: CHANGE ME backups report $(date) + + EOF + borg check --last 2 --info 2>&1 +} + +check | sendmail '~sircmpwn/sr.ht-ops@lists.sr.ht' +``` + +## Monitoring + +Each backup reports its timestamp and duration to our Prometheus Pushgateway +(see [monitoring](/ops/monitoring.md)). We have an alarm configured for when the +backup age exceeds 48 hours. The age of all borg backups may be [viewed +here][backup age] (in hours). + +We also conduct a weekly `borg check` (on Sunday night, UTC) and forward the +results to the [ops mailing list][ops ml]. + +[backup age]: https://metrics.sr.ht/graph?g0.range_input=1h&g0.expr=(time()%20-%20last_backup)%20%2F%2060%20%2F%2060&g0.tab=1 + +## Areas for improvement + +1. Our PostgreSQL replication strategy is somewhat poor, due to several + different approaches being experimented with on the same server, and lack of + monitoring. This needs to be rethought. Related to + [high availability](/ops/availability.md). +2. It would be nice if we could find a way to encapsulate our borg scripts in an + installable Alpine package. diff --git a/ops/emergency-planning.md b/ops/emergency-planning.md new file mode 100644 index 0000000..869273a --- /dev/null +++ b/ops/emergency-planning.md @@ -0,0 +1,21 @@ +--- +title: Emergency planning +--- + +On several occasions, outages have been simulated and the motions carried out +for resolving them. This is useful for: + +1. Testing that our systems can tolerate or recover from such failures +2. Familiarizing operators with the resolution procedures + +This has been conducted informally. We should put some more structure to it, and +plan these events regularly. + +Ideas: + +- Simulate disk failures (yank out a hard drive!) +- Simulate outages for redundant services + (see [availability](/ops/availability.md)) +- Kill celery workers and see how they cope with catching up again +- Restore systems from backup, then put the restored system into normal service + and tear down the original diff --git a/ops/index.md b/ops/index.md new file mode 100644 index 0000000..e0814b0 --- /dev/null +++ b/ops/index.md @@ -0,0 +1,48 @@ +--- +title: SourceHut operational manual +--- + +This subset of the manual documents our approach to the operations and +maintenance of the hosted service, sr.ht. You may find this useful for running +your own hosted sr.ht service, or to evaluate our practices & policies to +consider if they meet your requirements for availability or robustness. You also +might just find this stuff interesting, as SourceHut is one of the few largeish +services which is not hosted in The Cloud™. + +- [Backups & redundancy](/ops/backups.md) +- [Emergency planning](/ops/emergency-planning.md) +- [High availability](/ops/availability.md) +- [Monitoring & alarms](/ops/monitoring.md) +- [Network topology](/ops/topology.md) +- [Provisioning & allocation](/ops/provisioning.md) + +# Operational Resources + +## Status page + +[status.sr.ht](https://status.sr.ht) is hosted on third-party infrastructure and +is used to communicate about upcoming planned outages, and to provide updates +during incident resolution. Planned outages are also posted to +[sr.ht-announce](https://lists.sr.ht/~sircmpwn/sr.ht-announce) in advance. + +The status page is updated by a human being, who is probably busy fixing the +problem. You may want to check the next resource as well: + +## Monitoring & alarms + +Our Prometheus instance at [metrics.sr.ht](https://metrics.sr.ht) is available +to the public for querying our monitoring systems and viewing the state of +various alarms. + +## Mailing list + +The [sr.ht-ops](https://lists.sr.ht/~sircmpwn/sr.ht-ops) mailing list is used +for automated reports from our services, including alarm notifications of +"important" or "urgent" severity, and automated reports on operational status of +backups and other systems. + +## IRC channel + +The `#sr.ht.ops` IRC channel on irc.freenode.net is used for triage and +coordination during outages, and has a real-time feed of alarms raised by our +monitoring system. diff --git a/ops/monitoring.md b/ops/monitoring.md new file mode 100644 index 0000000..833a271 --- /dev/null +++ b/ops/monitoring.md @@ -0,0 +1,48 @@ +--- +title: "Monitoring & alarms" +--- + +We monitor everything with Prometheus, and configure alarms with alertmanager. + +# Public metrics + +Our Prometheus instance is publically available at +[metrics.sr.ht](https://metrics.sr.ht). + +## Areas for improvement + +1. We should make dashboards. It would be pretty to look at and could be a + useful tool for root cause analysis. Note that some users who have their own + Grafana instance have pointed it at our public Prometheus data and made some + simple dashboards - I would be open to having community ownership over this. + +# Pushgateway + +A pushgateway is running at push.metrics.sr.ht. It's firewalled to only accept +connections from [our subnet](/ops/topology.md). + +# Aggregation gateway + +[prom-aggregation-gateway](https://github.com/weaveworks/prom-aggregation-gateway) +is running at aggr.metrics.sr.ht. It's firewalled to only accept connections +from [our subnet](/ops/topology.md). + +# Alertmanager + +We use alertmanager to forward [alerts](https://metrics.sr.ht/alerts) to various +sinks. + +- **interesting** alerts are forwarded to the IRC channel, #sr.ht.ops +- **important** alerts are sent the ops mailing list, and the IRC channel +- **urgent** alerts page Drew's phone, are sent to the mailing list, and the IRC + channel + +Some security-related alarms are sent directly to Drew and are not made public. + +# Areas for improvement + +1. Would be nice to have centralized logging. There is sensitive information in + some of our logs, so this probably can't be made public. +2. Several of our physical hosts are not being monitored. This will be resolved + during [planned maintenance](https://status.sr.ht/issues/2020-03-11-planned-outage/) + on March 3rd, 2020. diff --git a/ops/provisioning.md b/ops/provisioning.md new file mode 100644 index 0000000..0bbfba0 --- /dev/null +++ b/ops/provisioning.md @@ -0,0 +1,48 @@ +--- +title: "Server provisioning & allocation" +--- + +Standards for provisioning of VMs and physical hosts. + +# Alpine Linux + +Our standard loadout uses Alpine Linux for all hosts and guests. + +- **TODO**: ns1 and ns2 are on Debian, they need to be reprovisioned (add an ns3 + while we're at it?) + +# Physical hosts + +## VM hosts + +Our current VM hosts won't scale much further. We need to figure out a better +hardware loadout for these going forward. + +## High performance VM hosts + +For performance critical services (presently only git.sr.ht demands this), we do +have a standard loadout: + +- AMD EPYC 7402 (24 cores, 48 threads) +- Micron 36ASF2G72PZ-2G6F1 RAM (4x, 64G total) +- 1x NVMe for host system, ext4; WD Black 1T +- 4x SSD on SATA, direct passthrough to guest + +This is spec'ed to CPU and I/O intensive workloads. + +## Build hosts + +builds.sr.ht uses dedicated build runners. Our current standard is: + +- AS-1013S-MTR SuperMicro barebones +- AMD EPYC 7281 (16 cores, 32 threads) +- M393A4K40CB2-CTD RAM (4x, 128G total) +- 1x NVMe for root, ext4; WD Black 1T +- 3x HDD on SATA for /var, ZFS; 1T each, various vendors/models + +This configuration supports up to 16 parallel build slots. + +# Virtual machines + +There is no standard loadout - tune the specifications for the task at hand. +Generally limit 1 VM == 1 service, and tune accordingly. diff --git a/ops/topology.md b/ops/topology.md new file mode 100644 index 0000000..95b53b8 --- /dev/null +++ b/ops/topology.md @@ -0,0 +1,130 @@ +--- +title: SourceHut network topology +--- + +We don't use a NAT (yet). This is a list of where our IP addresses are +allocated, and the hosts for each VM. + +# Subnets + +- 173.195.146.128/25 (255.255.255.128) +- 2604:BF00:710::/64 + +**Gateways** + +- 173.195.146.129 +- 2604:BF00:710::1 + +# Allocations + +**Next IPs** + +Virtual machines (grows upwards): + +- 173.195.146.137 (reclaimed from mail.sr.ht v1) +- 173.195.146.140 (reclaimed from metrics.sr.ht v1) +- 173.195.146.154 + +Hosts (grows downwards): + +- 173.195.146.243 + +## cirno1.sr.ht + +Purpose: build slave + +Host IP: 173.195.146.249, 2604:bf00:710:0:ae1f:6bff:fead:55a + +## cirno2.sr.ht + +Purpose: build slave + +Host IP: 173.195.146.244, 2604:bf00:710:0:ae1f:6bff:fe79:a33e + +## yukari.sr.ht + +Purpose: file storage + +Host IP: 173.195.146.250, 2604:bf00:710:0:230:48ff:fedf:3552 + +## remilia.sr.ht + +Purpose: SQL primary + +Host IP: 173.195.146.251, 2604:bf00:710:0:230:48ff:fe7d:599a + +## alice1.sr.ht + +Purpose: VM host + +Host IP: 208.88.54.60 (**old subnet**), 2604:bf00:710:0:225:90ff:fe09:440 + +### Guests + +- packages.knightos.org: 173.195.146.130 +- ns2.sr.ht: 173.195.146.138, 2604:bf00:710:0:5054:ff:fe77:d4b4/64 +- man.sr.ht: 173.195.146.146, 2604:bf00:710:0:5054:ff:feb7:8f16 +- static1.sr.ht: 173.195.146.149, 2604:bf00:710:0:5054:ff:fe0c:b76e +- metrics.sr.ht: 173.195.146.153, 2604:bf00:710:0:5054:ff:fee3:409 + +## alice2.sr.ht + +Purpose: VM host + +Host IP: 208.88.54.61 (**old subnet**), 2604:bf00:710:0:230:48ff:fedc:ab06 + +### Guests + +- drewdevault.com: 173.195.146.133 +- todo.sr.ht: 173.195.146.145, 2604:bf00:710:0:5054:ff:fea1:a941 +- dispatch.sr.ht: 173.195.146.147, 2604:bf00:710:0:5054:ff:fe44:efcb +- hg.sr.ht: 173.195.146.139, 2604:bf00:710:0:5054:ff:fe25:1aa6 + +## alice3.sr.ht + +Purpose: VM host + +Host IP: 208.88.54.62 (**old subnet**), 2604:bf00:710:0:230:48ff:fedb:c474 + +### Guests + +- cmpwn.com: 173.195.146.132 +- ns1.sr.ht: 173.195.146.135, 2604:bf00:710:0:5054:ff:fe65:e92f +- irc.sircmpwn.com: 173.195.146.141 +- meta.sr.ht: 173.195.146.143, 2604:bf00:710:0:5054:ff:fe97:74b3 +- lists.sr.ht: 173.195.146.144, 2604:bf00:710:0:5054:ff:fec4:6bfb +- builds.sr.ht: 173.195.146.148, 2604:bf00:710:0:5054:ff:fe50:b8e1 +- paste.sr.ht: 173.195.146.150, 2604:bf00:710:0:5054:ff:fec4:3586 +- mail-b.sr.ht: 173.195.146.151, 2604:bf00:710:0:5054:ff:fee5:c082 + +## alice4.sr.ht + +Purpose: VM host (**out of service**) + +Host IP: 173.195.146.248 + +**Note**: this machine has severe issues, needs to be diagnosed. Don't host +anything here. + +## patchouli.sr.ht + +Purpose: blob storage + +Host IP: 173.195.146.247, 2604:bf00:710:0:ae1f:6bff:fec2:5502 + +## flandre.sr.ht + +Purpose: ppc64le build host + +Host IP: 173.195.146.246, 2604:bf00:710:0:2e09:4dff:fe00:992 + +## tenshi.sr.ht + +Purpose: High-performance VM host + +Host IP: 173.195.146.245 + +### Guests + +- git.sr.ht: 173.195.146.142, 2604:bf00:710:0:5054:ff:fe36:ebc6 +- legacy.sr.ht: 173.195.146.152, 2604:bf00:710:0:5054:ff:fe8f:b9de |