aboutsummaryrefslogtreecommitdiffstats
path: root/ops/robust-psql.md
blob: e8890bb286c3b20491af1236e132d2b8da3efa5c (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
PostgreSQL is a critical part of the SourceHut infrastructure in terms of
availability, performance, and data integrity.

# Streaming replication

**Goals**

1. Real-time off-site backup of the database
1. Ability to access the database as it appeared in the past
1. Ability to failover to the standby in the event of an outage on the
   primary, with minimal application downtime
1. Ability to failover to the standby during planned maintenance of the
   primary, with minimal application downtime

**Prerequisites**

- ✓ PostgreSQL 12.x on primary
- ✓ PostgreSQL 12.x on standby

**Nice-to-have**

- ✗ Alpine 3.12 on primary
- ✓ Alpine 3.12 on standby

If we have to restart the primary PostgreSQL daemon for any reason, we should
take the opportunity to upgrade the host to Alpine 3.12.

## Preparing the hot standby

First we'll have to prepare a [warm standby][warm standby] (which can be
promoted to primary at short notice, but does not accept queries), then we can
reconfigure it as a [hot standby][hot standby] (which will accept queries).

1. Set up a user on the primary for the standby to connect as for streaming
   replication. It must have the REPLICATION and LOGIN privileges and
   [pg_hba.conf][hba changes] must be updated.
1. Take a [base backup][base backup] of the primary and ship it to the
   standby. Be advised: we have `full_page_writes` disabled as a ZFS
   optimization; this may cause performance degredation during the basebackup
   process. `pg_basebackup -PzFt -D basebackup`. Place 'basedirectory' in
   `/var/lib/postgresql`.
1. Nuke the postgres data directory's contents on the standby, su to postgres,
   then run these commands: `tar -C ~/12/data -xf ~/basebackup/base.tar.gz`,
   `tar -C ~/12/data/pg_wal -xf ~/basebackup/pg_wal.tar.gz`,
   `touch ~/12/data/standby.signal`
1. Fill out `primary_conninfo` on the standby with the connection details for
   the account created for replication. `host=... port=5432 user=...  password=...`
1. Configure a replication slot on the primary for the standby:
    
        SELECT * FROM pg_create_physical_replication_slot('konpaku');

    Then set `primary_slot_name` to `konpaku` on the standby.
1. Set `hot_standby` to "on" on the standby.
1. Start the database, monitor logs for restore completion.
1. Sanity check the standby database.

Note: some of this procedure is adapted from the
[restore the base backup][restore base backup] docs, but modified cause those
docs are shit.

[warm standby]: https://www.postgresql.org/docs/current/warm-standby.html
[hot standby]: https://www.postgresql.org/docs/current/hot-standby.html
[base backup]: https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-BASE-BACKUP
[restore base backup]: https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-PITR-RECOVERY
[hba changes]: https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION-AUTHENTICATION

✓ Process tested

✓ Completed in production

# Deploying pgbouncer

pgbouncer will allow us to quickly change which database server the fleet is
using at any given moment. It does not handle load balancing directly, we would
need to use a secondary approach for this if desirable (DNS round robin and
haproxy are suggested by the docs).

A configuration similar to the following will be deployed fleet-wide:

```
[pgbouncer]
listen_addr = 127.0.0.1
listen_port = 16432
pool_mode = transaction

auth_type = trust
auth_file = /etc/pgbouncer/userlist.txt

max_client_conn = 100
default_pool_size = 20

[databases]
xxx.sr.ht=host=remilia.sr.ht user=xxx.sr.ht password=...
```

config.ini's will point at localhost through pgbouncer rather than directly at
the database servers. The initd scripts will need to be updated to `use
pgbouncer` so that services are started up correctly.

✓ Process tested

# Application-level updates

For web applications, we should abort requests which would write to the database
and instead show a page explaining the circumstances. The user can attempt a
refresh later to re-attempt the same action.

For the SSH pipelines:

- Repository auto-creation attempts should abort and tell the user
- Repository mtime updates should be silently dropped
- If unknown SSH keys can be fetched from meta.sr.ht, use them, but don't store them
- Webhook delivery attempts should be dropped and a warning printed

For builds.sr.ht, if any database update is met with a read-only error,
retry with an exponential backoff and block until it succeeds. If it exceeds a
deadline, then abort the build.

For webhook processing celery workers, it should be acceptable for it to simply
fail. Webhooks are only generated when the database is updated, so if the
database can't be updated, ¯\\\_(ツ)\_/¯!

There may be some room for future improvements in all of these areas, but this
is sufficient for the MVP.

✓ core.sr.ht patch to (somewhat) gracefully handle a read-only database

✗ read-only support for celery jobs

✗ read-only support for git.sr.ht, hg.sr.ht SSH pipeline

✗ read-only support for ongoing builds.sr.ht jobs

# Failover procedure (read-only mode)

*If possible, this is to be announced in advance. It will cause a partial outage
of all services.*

First, prepare all services for failover by updating their pgbouncer
configurations to point to the standby. Do not reload any of the pgbouncer
instances until all service configs have been updated, then use a script to
update them all at once.

```sh
#!/bin/sh
hosts=meta.sr.ht \
  git.sr.ht \
  hg.sr.ht \
  builds.sr.ht \
  lists.sr.ht \
  todo.sr.ht \
  man.sr.ht \
  paste.sr.ht \
  hub.sr.ht \
  cirno1.sr.ht \
  cirno2.sr.ht \
  legacy.sr.ht \
  mail-b.sr.ht

for host in $hosts
do
    ssh $host doas service pgbouncer reload
done
```

✓ Process tested

# Standby promotion procedure (read/write mode)

1. Prerequisite: complete the read-only failover procedure
1. Sanity check the application connection to the standby
1. Shut off the master
1. Run `pg_ctl promote` on the standby

✓ Process tested

# Master demotion procedure

In this procedure, the server being demoted is referred to as the "old master"
and the new master is referred to as such.

1. Prerequisite: complete the standby promotion procedure
1. Create a replication slot for the old master:

        SELECT * FROM pg_create_physical_replication_slot('name...');
1. Create a user for the old master with login and replication privs
1. Fill out `primary_slot_name` and `primary_conninfo` in the old master's
   config, and set `hot_standby` to on
1. Confirm that the old master has replication access in the new master's
   `pg_hba.conf`
1. Start the old master and monitor the logs, it should catch up
1. Sanity check the database

The old master is now the standby, and the old standby is now the master.

✓ Process tested

# Database upgrade procedure

## Minor version upgrades

This procedure will incur up to a few minutes of downtime, but ideally the
outage can only last as little as a few seconds.

1. Complete the standby promotion procedure
1. Upgrade the master, then complete the master demotion procedure

At this point, the master has become the standby and is now up-to-date. Repeat
the process again to upgrade the new standby and return the master/standby
assignments to normal.

✗ TODO: Test this process

## Major version upgrades

This process will incur a longer service degredation. The services will be in
read-only mode for up to 30 minutes.

1. Complete the failover procedure
1. Stop replication on the standby: `select pg_xlog_replay_pause()`
1. Perform the upgrade and reboot the host
1. Sanity check the database
1. Reverse the failover procedure
1. Complete the "Preparing the hot standby" procedure to rebuild the standby
   from scratch; do system updates and reboot

✗ TODO: Test this process

# Accessing past data

The easiest way to do this is likely to be mounting an old version of the ZFS
dataset on either the master or standby server, and running a secondary
PostgreSQL instance from it.

✗ TODO: Test this process

# Changes to make on the next primary reboot

- Set `wal_level` to replica; we are not going to use logical replication. TODO:
  Will this create WAL files which are incompatible with one another?
- Drop the `archive_command` and set `archive_mode` to off; we're going to use
  streaming replication instead of log shipping.

# Open questions

- What is the plan of execution for replacing the server hardware?
- How often are WALs recycled, how far into the past can we peek?
- Can we/should we detect failure conditions and automate the failover
  procedure?
- What kinds of monitoring and alarms can we/should we set up?
- How can we maintain an ongoing high level of confidence in our system?

It may be possible to *always* service SELECT statements from either the primary
or standby. If we eventually require load balancing, these queries could be
eligible. However, queries (or transactions?) which mutate the database may only
be executed on the primary. Therefore:

- If we require more capacity for writes, is vertical scaling our only option?
- In the event of a primary outage, are we able to transparently direct reads
  to the standby?
- Can application-level changes be implemented which would allow for reduced
  service availability (read-only) in the event of a primary outage?
- If applications cannot write to the database, but can write to other data
  stores (filesystem, redis, etc), can inconsistent states be produced?
- Can we simulate typical and excessive levels of application traffic against
  the database for mock scenarios?

# Mock contingency scenarios

We need to set up a test network with a primary and standby and induce failure
scenarios to learn how our applications behave under those conditions and
practice the failure resolution steps. Such a test network is already partially
provisioned on Drew's home network.

This will require at least the following:

- A couple of sr.ht applications (meta.sr.ht and git.sr.ht would be good)
- A primary and standby PostgreSQL server

We should play out the following scenarios:

- Promoting the standby to the primary
- Shut down the primary or standby without pre-preparation
- Accessing the database state at an arbitrary point in the past
- Bringing the standby back up to date after a primary outage
- Major and minor version upgrades (pg 11 to 12)
- Host OS upgrades (Alpine 3.11 to 3.12 or edge)
- Cause a ZFS resilver on the primary and monitor database performance: does
  this trigger a "soft outage" where performance is awful but the database is
  ostensibly online?

At each stage of these processes, observe application behavior and record these
observations.

# Verification process

The whole setup needs to be performed on a secondary deployment and each service
carefully tested under normal and failure mode conditions. Of particular
interest is:

- Do git.sr.ht's more sophisticated pg commands from the SSH pipeline work
  correctly?
- Does lists.sr.ht's prepared statements get along well with pgbouncer?
- Do all services work reasonably well in a read-only mode?
- Do all services work correctly with pgbouncer's transaction mode?