--- title: "SourceHut backups & redundancy" --- The integrity of user data is of paramount importance to SourceHut. Most of our data-critical systems are triple-redundant or better. # Local redundancy All of our data-critical systems use ZFS with at least 3 drives, which allows up to one drive to fail without data loss. Our large storage systems use 5+ drives, which allows several drives to fail. Our standard hardware loadout calls for hard drives (or SSDs) sourced from a variety of different vendors and drive models, to avoid using several hard drives from the same production batch. This reduces the risk of cascading failures during RAID recovery. ## Monitoring We do an automatic scrub of all ZFS pools on the 1st of each month and forward a report to the [ops mailing list][ops ml]. [ops ml]: https://lists.sr.ht/~sircmpwn/sr.ht-ops ## Areas for improvement 1. Automatic ZFS snapshots are only configured for off-site backup hosts. We should configure this on the primary as well. We also need monitoring to ensure that our snapshots are actually being taken. 2. Investigate something like [repospanner](https://github.com/repoSpanner/repoSpanner) to block git pushes until the data is known to be received and stored across multiple servers — would make git backups real-time # Off-site backups We have an off-site backup system in a separate datacenter (in a different city) from our primary datacenter. We use borg backup to send backups to this server, typically hourly. The standard backup script looks something like this, but is tweaked for each service: ``` #!/bin/sh -eu export BORG_REPO='ssh://gitsrht@konpaku.sr.ht/~/backup' export BORG_PASSPHRASE='redacted' backup_start="$(date -u +'%s')" echo "borg create" borg create \ ::git.sr.ht-repos-"$(date +"%Y-%m-%d_%H:%M")" \ /var/lib/git \ -e /var/lib/git/.ssh \ -e /var/lib/git/.gnupg \ -e /var/lib/git/.ash_history \ -e /var/lib/git/.viminfo \ -e '/var/lib/git/*/*/objects/incoming-*' \ -e '*.keep' \ --compression lz4 \ --one-file-system \ --info --stats "$@" echo "borg prune" borg prune \ --keep-hourly 48 \ --keep-daily 60 \ --keep-weekly -1 \ --info --stats stats() { backup_end="$(date -u +'%s')" printf '# TYPE last_backup gauge\n' printf '# HELP last_backup Unix timestamp of last backup\n' printf 'last_backup{instance="git.sr.ht"} %d\n' "$backup_end" printf '# TYPE backup_duration gauge\n' printf '# HELP backup_duration Number of seconds most recent backup took to complete\n' printf 'backup_duration{instance="git.sr.ht"} %d\n' "$((backup_end-backup_start))" } stats | curl --data-binary @- https://push.metrics.sr.ht/metrics/job/git.sr.ht ``` Our `check` script is: ``` #!/bin/sh -eu export BORG_REPO='ssh://gitsrht@konpaku.sr.ht/~/backup' export BORG_PASSPHRASE='redacted' check() { cat <<-EOF To: SourceHut Ops <~sircmpwn/sr.ht-ops@lists.sr.ht> From: git.sr.ht backups Subject: git.sr.ht backups report $(date) EOF borg check --last 2 --info 2>&1 } check | sendmail '~sircmpwn/sr.ht-ops@lists.sr.ht' ``` ## Monitoring Each backup reports its timestamp and duration to our Prometheus Pushgateway (see [monitoring](/ops/monitoring.md)). We have an alarm configured for when the backup age exceeds 48 hours. The age of all borg backups may be [viewed here][backup age] (in hours). We also conduct a weekly `borg check` (on Sunday night, UTC) and forward the results to the [ops mailing list][ops ml]. [backup age]: https://metrics.sr.ht/graph?g0.range_input=1h&g0.expr=(time()%20-%20last_backup)%20%2F%2060%20%2F%2060&g0.tab=1 ## Areas for improvement 1. Our PostgreSQL replication strategy is somewhat poor, due to several different approaches being experimented with on the same server, and lack of monitoring. This needs to be rethought. Related to [high availability](/ops/availability.md). 2. It would be nice if we could find a way to encapsulate our borg scripts in an installable Alpine package.