diff options
author | Drew DeVault <sir@cmpwn.com> | 2020-03-05 17:06:47 -0500 |
---|---|---|
committer | Drew DeVault <sir@cmpwn.com> | 2020-03-05 17:06:47 -0500 |
commit | 747994f68fa7eb21a68e2aa7e04085369b9c92ab (patch) | |
tree | ef93615cd60cd126714c150833b0fafbac933af7 /ops/backups.md | |
parent | 8a8161fd6aec91dae49b57bb3337c0f5dafb1590 (diff) | |
download | sr.ht-docs-747994f68fa7eb21a68e2aa7e04085369b9c92ab.tar.gz |
Add operational documentation
Diffstat (limited to 'ops/backups.md')
-rw-r--r-- | ops/backups.md | 121 |
1 files changed, 121 insertions, 0 deletions
diff --git a/ops/backups.md b/ops/backups.md new file mode 100644 index 0000000..74de1ec --- /dev/null +++ b/ops/backups.md @@ -0,0 +1,121 @@ +--- +title: "SourceHut backups & redundancy" +--- + +The integrity of user data is of paramount importance to SourceHut. Most of our +data-critical systems are triple-redundant or better. + +# Local redundancy + +All of our data-critical systems use ZFS with at least 3 drives, which allows up +to one drive to fail without data loss. Our large storage systems use 5+ drives, +which allows several drives to fail. + +Our standard hardware loadout calls for hard drives (or SSDs) sourced from a +variety of different vendors and drive models, to avoid using several hard +drives from the same production batch. This reduces the risk of cascading +failures during RAID recovery. + +## Monitoring + +We do an automatic scrub of all ZFS pools on the 1st of each month and forward a +report to the [ops mailing list][ops ml]. + +[ops ml]: https://lists.sr.ht/~sircmpwn/sr.ht-ops + +## Areas for improvement + +1. Automatic ZFS snapshots are only configured for off-site backup hosts. We + should configure this on the primary as well. We also need monitoring to + ensure that our snapshots are actually being taken. +2. Investigate something like [repospanner](https://github.com/repoSpanner/repoSpanner) + to block git pushes until the data is known to be received and stored across + multiple servers - would make git backups real-time + +# Off-site backups + +We have an off-site backup system in a separate datacenter (in a different city) +from our primary datacenter. We use borg backup to send backups to this server, +typically hourly. The standard backup script is: + +``` +#!/bin/sh -eu +export BORG_REPO='ssh://CHANGE ME@konpaku.sr.ht/~/backup' +export BORG_PASSPHRASE='CHANGE ME' + +backup_start="$(date -u +'%s')" + +echo "borg create" +borg create \ + ::git.sr.ht-repos-"$(date +"%Y-%m-%d_%H:%M")" \ + /var/lib/git \ + -e /var/lib/git/.ssh \ + -e /var/lib/git/.gnupg \ + -e /var/lib/git/.ash_history \ + -e /var/lib/git/.viminfo \ + -e '/var/lib/git/*/*/objects/incoming-*' \ + -e '*.keep' \ + --compression lz4 \ + --one-file-system \ + --info --stats "$@" + +echo "borg prune" +borg prune \ + --keep-hourly 48 \ + --keep-daily 60 \ + --keep-weekly -1 \ + --info --stats + +stats() { + backup_end="$(date -u +'%s')" + printf '# TYPE last_backup gauge\n' + printf '# HELP last_backup Unix timestamp of last backup\n' + printf 'last_backup{instance="git.sr.ht"} %d\n' "$backup_end" + printf '# TYPE backup_duration gauge\n' + printf '# HELP backup_duration Number of seconds most recent backup took to complete\n' + printf 'backup_duration{instance="git.sr.ht"} %d\n' "$((backup_end-backup_start))" +} + +stats | curl --data-binary @- https://push.metrics.sr.ht/metrics/job/CHANGE ME +``` + +Our `check` script is: + +``` +#!/bin/sh -eu +export BORG_REPO='ssh://CHANGE ME@konpaku.sr.ht/~/backup' +export BORG_PASSPHRASE='CHANGE ME' + +check() { + cat <<-EOF + To: SourceHut Ops <~sircmpwn/sr.ht-ops@lists.sr.ht> + From: CHANGE ME backups <borg@git.sr.ht> + Subject: CHANGE ME backups report $(date) + + EOF + borg check --last 2 --info 2>&1 +} + +check | sendmail '~sircmpwn/sr.ht-ops@lists.sr.ht' +``` + +## Monitoring + +Each backup reports its timestamp and duration to our Prometheus Pushgateway +(see [monitoring](/ops/monitoring.md)). We have an alarm configured for when the +backup age exceeds 48 hours. The age of all borg backups may be [viewed +here][backup age] (in hours). + +We also conduct a weekly `borg check` (on Sunday night, UTC) and forward the +results to the [ops mailing list][ops ml]. + +[backup age]: https://metrics.sr.ht/graph?g0.range_input=1h&g0.expr=(time()%20-%20last_backup)%20%2F%2060%20%2F%2060&g0.tab=1 + +## Areas for improvement + +1. Our PostgreSQL replication strategy is somewhat poor, due to several + different approaches being experimented with on the same server, and lack of + monitoring. This needs to be rethought. Related to + [high availability](/ops/availability.md). +2. It would be nice if we could find a way to encapsulate our borg scripts in an + installable Alpine package. |