aboutsummaryrefslogtreecommitdiffstats
path: root/ops/backups.md
diff options
context:
space:
mode:
authorDrew DeVault <sir@cmpwn.com>2020-03-05 17:06:47 -0500
committerDrew DeVault <sir@cmpwn.com>2020-03-05 17:06:47 -0500
commit747994f68fa7eb21a68e2aa7e04085369b9c92ab (patch)
treeef93615cd60cd126714c150833b0fafbac933af7 /ops/backups.md
parent8a8161fd6aec91dae49b57bb3337c0f5dafb1590 (diff)
downloadsr.ht-docs-747994f68fa7eb21a68e2aa7e04085369b9c92ab.tar.gz
Add operational documentation
Diffstat (limited to 'ops/backups.md')
-rw-r--r--ops/backups.md121
1 files changed, 121 insertions, 0 deletions
diff --git a/ops/backups.md b/ops/backups.md
new file mode 100644
index 0000000..74de1ec
--- /dev/null
+++ b/ops/backups.md
@@ -0,0 +1,121 @@
+---
+title: "SourceHut backups & redundancy"
+---
+
+The integrity of user data is of paramount importance to SourceHut. Most of our
+data-critical systems are triple-redundant or better.
+
+# Local redundancy
+
+All of our data-critical systems use ZFS with at least 3 drives, which allows up
+to one drive to fail without data loss. Our large storage systems use 5+ drives,
+which allows several drives to fail.
+
+Our standard hardware loadout calls for hard drives (or SSDs) sourced from a
+variety of different vendors and drive models, to avoid using several hard
+drives from the same production batch. This reduces the risk of cascading
+failures during RAID recovery.
+
+## Monitoring
+
+We do an automatic scrub of all ZFS pools on the 1st of each month and forward a
+report to the [ops mailing list][ops ml].
+
+[ops ml]: https://lists.sr.ht/~sircmpwn/sr.ht-ops
+
+## Areas for improvement
+
+1. Automatic ZFS snapshots are only configured for off-site backup hosts. We
+ should configure this on the primary as well. We also need monitoring to
+ ensure that our snapshots are actually being taken.
+2. Investigate something like [repospanner](https://github.com/repoSpanner/repoSpanner)
+ to block git pushes until the data is known to be received and stored across
+ multiple servers - would make git backups real-time
+
+# Off-site backups
+
+We have an off-site backup system in a separate datacenter (in a different city)
+from our primary datacenter. We use borg backup to send backups to this server,
+typically hourly. The standard backup script is:
+
+```
+#!/bin/sh -eu
+export BORG_REPO='ssh://CHANGE ME@konpaku.sr.ht/~/backup'
+export BORG_PASSPHRASE='CHANGE ME'
+
+backup_start="$(date -u +'%s')"
+
+echo "borg create"
+borg create \
+ ::git.sr.ht-repos-"$(date +"%Y-%m-%d_%H:%M")" \
+ /var/lib/git \
+ -e /var/lib/git/.ssh \
+ -e /var/lib/git/.gnupg \
+ -e /var/lib/git/.ash_history \
+ -e /var/lib/git/.viminfo \
+ -e '/var/lib/git/*/*/objects/incoming-*' \
+ -e '*.keep' \
+ --compression lz4 \
+ --one-file-system \
+ --info --stats "$@"
+
+echo "borg prune"
+borg prune \
+ --keep-hourly 48 \
+ --keep-daily 60 \
+ --keep-weekly -1 \
+ --info --stats
+
+stats() {
+ backup_end="$(date -u +'%s')"
+ printf '# TYPE last_backup gauge\n'
+ printf '# HELP last_backup Unix timestamp of last backup\n'
+ printf 'last_backup{instance="git.sr.ht"} %d\n' "$backup_end"
+ printf '# TYPE backup_duration gauge\n'
+ printf '# HELP backup_duration Number of seconds most recent backup took to complete\n'
+ printf 'backup_duration{instance="git.sr.ht"} %d\n' "$((backup_end-backup_start))"
+}
+
+stats | curl --data-binary @- https://push.metrics.sr.ht/metrics/job/CHANGE ME
+```
+
+Our `check` script is:
+
+```
+#!/bin/sh -eu
+export BORG_REPO='ssh://CHANGE ME@konpaku.sr.ht/~/backup'
+export BORG_PASSPHRASE='CHANGE ME'
+
+check() {
+ cat <<-EOF
+ To: SourceHut Ops <~sircmpwn/sr.ht-ops@lists.sr.ht>
+ From: CHANGE ME backups <borg@git.sr.ht>
+ Subject: CHANGE ME backups report $(date)
+
+ EOF
+ borg check --last 2 --info 2>&1
+}
+
+check | sendmail '~sircmpwn/sr.ht-ops@lists.sr.ht'
+```
+
+## Monitoring
+
+Each backup reports its timestamp and duration to our Prometheus Pushgateway
+(see [monitoring](/ops/monitoring.md)). We have an alarm configured for when the
+backup age exceeds 48 hours. The age of all borg backups may be [viewed
+here][backup age] (in hours).
+
+We also conduct a weekly `borg check` (on Sunday night, UTC) and forward the
+results to the [ops mailing list][ops ml].
+
+[backup age]: https://metrics.sr.ht/graph?g0.range_input=1h&g0.expr=(time()%20-%20last_backup)%20%2F%2060%20%2F%2060&g0.tab=1
+
+## Areas for improvement
+
+1. Our PostgreSQL replication strategy is somewhat poor, due to several
+ different approaches being experimented with on the same server, and lack of
+ monitoring. This needs to be rethought. Related to
+ [high availability](/ops/availability.md).
+2. It would be nice if we could find a way to encapsulate our borg scripts in an
+ installable Alpine package.