Add operational documentation

author: Drew DeVault <sir@cmpwn.com> 2020-03-05 17:06:47 -0500
committer: Drew DeVault <sir@cmpwn.com> 2020-03-05 17:06:47 -0500
commit: 747994f68fa7eb21a68e2aa7e04085369b9c92ab (patch)
tree: ef93615cd60cd126714c150833b0fafbac933af7 /ops/backups.md
parent: 8a8161fd6aec91dae49b57bb3337c0f5dafb1590 (diff)
download: sr.ht-docs-747994f68fa7eb21a68e2aa7e04085369b9c92ab.tar.gz
1 files changed, 121 insertions, 0 deletions
diff --git a/ops/backups.md b/ops/backups.md
new file mode 100644
index 0000000..74de1ec
--- /dev/null
+++ b/ops/backups.md
@@ -0,0 +1,121 @@
+---
+title: "SourceHut backups & redundancy"
+---
+
+The integrity of user data is of paramount importance to SourceHut. Most of our
+data-critical systems are triple-redundant or better.
+
+# Local redundancy
+
+All of our data-critical systems use ZFS with at least 3 drives, which allows up
+to one drive to fail without data loss. Our large storage systems use 5+ drives,
+which allows several drives to fail.
+
+Our standard hardware loadout calls for hard drives (or SSDs) sourced from a
+variety of different vendors and drive models, to avoid using several hard
+drives from the same production batch. This reduces the risk of cascading
+failures during RAID recovery.
+
+## Monitoring
+
+We do an automatic scrub of all ZFS pools on the 1st of each month and forward a
+report to the [ops mailing list][ops ml].
+
+[ops ml]: https://lists.sr.ht/~sircmpwn/sr.ht-ops
+
+## Areas for improvement
+
+1. Automatic ZFS snapshots are only configured for off-site backup hosts. We
+   should configure this on the primary as well. We also need monitoring to
+   ensure that our snapshots are actually being taken.
+2. Investigate something like [repospanner](https://github.com/repoSpanner/repoSpanner)
+   to block git pushes until the data is known to be received and stored across
+   multiple servers - would make git backups real-time
+
+# Off-site backups
+
+We have an off-site backup system in a separate datacenter (in a different city)
+from our primary datacenter. We use borg backup to send backups to this server,
+typically hourly. The standard backup script is:
+
+```
+#!/bin/sh -eu
+export BORG_REPO='ssh://CHANGE ME@konpaku.sr.ht/~/backup'
+export BORG_PASSPHRASE='CHANGE ME'
+
+backup_start="$(date -u +'%s')"
+
+echo "borg create"
+borg create \
+	::git.sr.ht-repos-"$(date +"%Y-%m-%d_%H:%M")" \
+	/var/lib/git \
+	-e /var/lib/git/.ssh \
+	-e /var/lib/git/.gnupg \
+	-e /var/lib/git/.ash_history \
+	-e /var/lib/git/.viminfo \
+	-e '/var/lib/git/*/*/objects/incoming-*' \
+	-e '*.keep' \
+	--compression lz4 \
+	--one-file-system \
+	--info --stats "$@"
+
+echo "borg prune"
+borg prune \
+	--keep-hourly 48 \
+	--keep-daily 60 \
+	--keep-weekly -1 \
+	--info --stats
+
+stats() {
+	backup_end="$(date -u +'%s')"
+	printf '# TYPE last_backup gauge\n'
+	printf '# HELP last_backup Unix timestamp of last backup\n'
+	printf 'last_backup{instance="git.sr.ht"} %d\n' "$backup_end"
+	printf '# TYPE backup_duration gauge\n'
+	printf '# HELP backup_duration Number of seconds most recent backup took to complete\n'
+	printf 'backup_duration{instance="git.sr.ht"} %d\n' "$((backup_end-backup_start))"
+}
+
+stats | curl --data-binary @- https://push.metrics.sr.ht/metrics/job/CHANGE ME
+```
+
+Our `check` script is:
+
+```
+#!/bin/sh -eu
+export BORG_REPO='ssh://CHANGE ME@konpaku.sr.ht/~/backup'
+export BORG_PASSPHRASE='CHANGE ME'
+
+check() {
+	cat <<-EOF
+	To: SourceHut Ops <~sircmpwn/sr.ht-ops@lists.sr.ht>
+	From: CHANGE ME backups <borg@git.sr.ht>
+	Subject: CHANGE ME backups report $(date)
+
+	EOF
+	borg check --last 2 --info 2>&1
+}
+
+check | sendmail '~sircmpwn/sr.ht-ops@lists.sr.ht'
+```
+
+## Monitoring
+
+Each backup reports its timestamp and duration to our Prometheus Pushgateway
+(see [monitoring](/ops/monitoring.md)). We have an alarm configured for when the
+backup age exceeds 48 hours. The age of all borg backups may be [viewed
+here][backup age] (in hours).
+
+We also conduct a weekly `borg check` (on Sunday night, UTC) and forward the
+results to the [ops mailing list][ops ml].
+
+[backup age]: https://metrics.sr.ht/graph?g0.range_input=1h&g0.expr=(time()%20-%20last_backup)%20%2F%2060%20%2F%2060&g0.tab=1
+
+## Areas for improvement
+
+1. Our PostgreSQL replication strategy is somewhat poor, due to several
+   different approaches being experimented with on the same server, and lack of
+   monitoring. This needs to be rethought. Related to
+   [high availability](/ops/availability.md).
+2. It would be nice if we could find a way to encapsulate our borg scripts in an
+   installable Alpine package.
author	Drew DeVault <sir@cmpwn.com>	2020-03-05 17:06:47 -0500
committer	Drew DeVault <sir@cmpwn.com>	2020-03-05 17:06:47 -0500
commit	747994f68fa7eb21a68e2aa7e04085369b9c92ab (patch)
tree	ef93615cd60cd126714c150833b0fafbac933af7 /ops/backups.md
parent	8a8161fd6aec91dae49b57bb3337c0f5dafb1590 (diff)
download	sr.ht-docs-747994f68fa7eb21a68e2aa7e04085369b9c92ab.tar.gz