aboutsummaryrefslogtreecommitdiffstats
path: root/ops/backups.md
blob: da09436db27459649421ca6d364b36d6504fc54a (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
title: "SourceHut backups & redundancy"
---

The integrity of user data is of paramount importance to SourceHut. Most of our
data-critical systems are triple-redundant or better.

# Local redundancy

All of our data-critical systems use ZFS with at least 3 drives, which allows up
to one drive to fail without data loss. Our large storage systems use 5+ drives,
which allows several drives to fail.

Our standard hardware loadout calls for hard drives (or SSDs) sourced from a
variety of different vendors and drive models, to avoid using several hard
drives from the same production batch. This reduces the risk of cascading
failures during RAID recovery.

## Monitoring

We do an automatic scrub of all ZFS pools on the 1st of each month and forward a
report to the [ops mailing list][ops ml].

[ops ml]: https://lists.sr.ht/~sircmpwn/sr.ht-ops

## Areas for improvement

1. Automatic ZFS snapshots are only configured for off-site backup hosts. We
   should configure this on the primary as well. We also need monitoring to
   ensure that our snapshots are actually being taken.
2. Investigate something like [repospanner](https://github.com/repoSpanner/repoSpanner)
   to block git pushes until the data is known to be received and stored across
   multiple servers - would make git backups real-time

# Off-site backups

We have an off-site backup system in a separate datacenter (in a different city)
from our primary datacenter. We use borg backup to send backups to this server,
typically hourly. The standard backup script looks something like this, but is
tweaked for each service:

```
#!/bin/sh -eu
export BORG_REPO='ssh://gitsrht@konpaku.sr.ht/~/backup'
export BORG_PASSPHRASE='redacted'

backup_start="$(date -u +'%s')"

echo "borg create"
borg create \
	::git.sr.ht-repos-"$(date +"%Y-%m-%d_%H:%M")" \
	/var/lib/git \
	-e /var/lib/git/.ssh \
	-e /var/lib/git/.gnupg \
	-e /var/lib/git/.ash_history \
	-e /var/lib/git/.viminfo \
	-e '/var/lib/git/*/*/objects/incoming-*' \
	-e '*.keep' \
	--compression lz4 \
	--one-file-system \
	--info --stats "$@"

echo "borg prune"
borg prune \
	--keep-hourly 48 \
	--keep-daily 60 \
	--keep-weekly -1 \
	--info --stats

stats() {
	backup_end="$(date -u +'%s')"
	printf '# TYPE last_backup gauge\n'
	printf '# HELP last_backup Unix timestamp of last backup\n'
	printf 'last_backup{instance="git.sr.ht"} %d\n' "$backup_end"
	printf '# TYPE backup_duration gauge\n'
	printf '# HELP backup_duration Number of seconds most recent backup took to complete\n'
	printf 'backup_duration{instance="git.sr.ht"} %d\n' "$((backup_end-backup_start))"
}

stats | curl --data-binary @- https://push.metrics.sr.ht/metrics/job/git.sr.ht
```

Our `check` script is:

```
#!/bin/sh -eu
export BORG_REPO='ssh://gitsrht@konpaku.sr.ht/~/backup'
export BORG_PASSPHRASE='redacted'

check() {
	cat <<-EOF
	To: SourceHut Ops <~sircmpwn/sr.ht-ops@lists.sr.ht>
	From: git.sr.ht backups <borg@git.sr.ht>
	Subject: git.sr.ht backups report $(date)

	EOF
	borg check --last 2 --info 2>&1
}

check | sendmail '~sircmpwn/sr.ht-ops@lists.sr.ht'
```

## Monitoring

Each backup reports its timestamp and duration to our Prometheus Pushgateway
(see [monitoring](/ops/monitoring.md)). We have an alarm configured for when the
backup age exceeds 48 hours. The age of all borg backups may be [viewed
here][backup age] (in hours).

We also conduct a weekly `borg check` (on Sunday night, UTC) and forward the
results to the [ops mailing list][ops ml].

[backup age]: https://metrics.sr.ht/graph?g0.range_input=1h&g0.expr=(time()%20-%20last_backup)%20%2F%2060%20%2F%2060&g0.tab=1

## Areas for improvement

1. Our PostgreSQL replication strategy is somewhat poor, due to several
   different approaches being experimented with on the same server, and lack of
   monitoring. This needs to be rethought. Related to
   [high availability](/ops/availability.md).
2. It would be nice if we could find a way to encapsulate our borg scripts in an
   installable Alpine package.