Updates to PSQL planning documentt

author: Drew DeVault <sir@cmpwn.com> 2020-06-30 14:42:29 -0400
committer: Drew DeVault <sir@cmpwn.com> 2020-06-30 14:42:29 -0400
commit: 598c64ac1df4d3c8a02768419a544096db1865b4 (patch)
tree: e6cfdbe532535e85fbfc6d49abd4b3669ddaab1e /ops
parent: 72fe32c37fd3cf615c368f0344edad310e425031 (diff)
download: sr.ht-docs-598c64ac1df4d3c8a02768419a544096db1865b4.tar.gz
1 files changed, 129 insertions, 32 deletions
diff --git a/ops/robust-psql.md b/ops/robust-psql.md
index 69ad2ec..8a157f5 100644
--- a/ops/robust-psql.md
+++ b/ops/robust-psql.md
@@ -7,22 +7,20 @@ availability, performance, and data integrity.
 
 1. Real-time off-site backup of the database
 1. Ability to access the database as it appeared in the past
-1. Ability to failover to the secondary in the event of an outage on the
+1. Ability to failover to the standby in the event of an outage on the
    primary, with minimal application downtime
-1. Ability to failover to the secondary during planned maintenance of the
+1. Ability to failover to the standby during planned maintenance of the
    primary, with minimal application downtime
 
 **Prerequisites**
 
-✓ PostgreSQL 12.x on primary
-
-✓ PostgreSQL 12.x on secondary
+- ✓ PostgreSQL 12.x on primary
+- ✓ PostgreSQL 12.x on standby
 
 **Nice-to-have**
 
-✗ Alpine 3.12 on primary
-
-✓ Alpine 3.12 on secondary
+- ✗ Alpine 3.12 on primary
+- ✓ Alpine 3.12 on standby
 
 If we have to restart the primary PostgreSQL daemon for any reason, we should
 take the opportunity to upgrade the host to Alpine 3.12.
@@ -33,28 +31,28 @@ First we'll have to prepare a [warm standby][warm standby] (which can be
 promoted to primary at short notice, but does not accept queries), then we can
 reconfigure it as a [hot standby][hot standby] (which will accept queries).
 
-1. Set up a user on the primary for the secondary to connect as for streaming
+1. Set up a user on the primary for the standby to connect as for streaming
    replication. It must have the REPLICATION and LOGIN privileges and
    [pg_hba.conf][hba changes] must be updated.
 1. Take a [base backup][base backup] of the primary and ship it to the
-   secondary. Be advised: we have `full_page_writes` disabled as a ZFS
+   standby. Be advised: we have `full_page_writes` disabled as a ZFS
    optimization; this may cause performance degredation during the basebackup
    process. `pg_basebackup -PzFt -D basebackup`. Place 'basedirectory' in
    `/var/lib/postgresql`.
-1. Nuke the postgres data directory's contents on the secondary, su to postgres,
+1. Nuke the postgres data directory's contents on the standby, su to postgres,
    then run these commands: `tar -C ~/12/data -xf ~/basebackup/base.tar.gz`,
    `tar -C ~/12/data/pg_wal -xf ~/basebackup/pg_wal.tar.gz`,
    `touch ~/12/data/standby.signal`
-1. Fill out `primary_conninfo` on the secondary with the connection details for
+1. Fill out `primary_conninfo` on the standby with the connection details for
    the account created for replication. `host=... port=5432 user=...  password=...`
-1. Configure a replication slot on the primary for the secondary:
+1. Configure a replication slot on the primary for the standby:
     
         SELECT * FROM pg_create_physical_replication_slot('konpaku');
 
-    Then set `primary_slot_name` to `konpaku` on the secondary.
-1. Set `hot_standby` to "on" on the secondary.
+    Then set `primary_slot_name` to `konpaku` on the standby.
+1. Set `hot_standby` to "on" on the standby.
 1. Start the database, monitor logs for restore completion.
-1. Sanity check the secondary database.
+1. Sanity check the standby database.
 
 Note: some of this procedure is adapted from the
 [restore the base backup][restore base backup] docs, but modified cause those
@@ -66,26 +64,117 @@ docs are shit.
 [restore base backup]: https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-PITR-RECOVERY
 [hba changes]: https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION-AUTHENTICATION
 
-# Failover procedure (primary & secondary role swap)
+✓ Process tested
 
-TBD
+# Deploying pgbouncer
+
+pgbouncer will allow us to quickly change which database server the fleet is
+using at any given moment. It does not handle load balancing directly, we would
+need to use a secondary approach for this if desirable (DNS round robin and
+haproxy are suggested by the docs).
+
+A configuration similar to the following will be deployed fleet-wide:
+
+```
+[pgbouncer]
+listen_addr = 127.0.0.1
+listen_port = 16432
+pool_mode = transaction
+
+auth_type = trust
+auth_file = /etc/pgbouncer/userlist.txt
+
+max_client_conn = 100
+default_pool_size = 20
+
+[databases]
+xxx.sr.ht=host=remilia.sr.ht user=xxx.sr.ht password=...
+```
+
+config.ini's will point at localhost through pgbouncer rather than directly at
+the database servers. The initd scripts will need to be updated to `use
+pgbouncer` so that services are started up correctly.
+
+✓ Process tested
+
+# Application-level updates
+
+For web applications, we should abort requests which would write to the database
+and instead show a page explaining the circumstances. The user can attempt a
+refresh later to re-attempt the same action.
+
+For the SSH pipelines:
+
+- Repository auto-creation attempts should abort and tell the user
+- Repository mtime updates should be silently dropped
+- If unknown SSH keys can be fetched from meta.sr.ht, use them, but don't store them
+- Webhook delivery attempts should be dropped and a warning printed
+
+For builds.sr.ht, if any database update is met with a read-only error,
+retry with an exponential backoff and block until it succeeds. If it exceeds a
+deadline, then abort the build.
+
+For webhook processing celery workers, it should be acceptable for it to simply
+fail. Webhooks are only generated when the database is updated, so if the
+database can't be updated, ¯\\\_(ツ)\_/¯!
+
+There may be some room for future improvements in all of these areas, but this
+is sufficient for the MVP.
+
+✓ core.sr.ht patch to (somewhat) gracefully handle a read-only database
+
+✗ read-only support for celery jobs
+
+✗ read-only support for git.sr.ht, hg.sr.ht SSH pipeline
+
+✗ read-only support for ongoing builds.sr.ht jobs
+
+# Failover procedure (read-only mode)
+
+*Ideally, this is to be announced in advance, as it will cause a partial outage
+across all services.*
+
+First, prepare all services for failover by updating their pgbouncer
+configurations to point to the standby. Do not reload pgbouncer until all
+service configs have been updated, then use a script to update them all at once.
+
+TODO: Write that script
+
+✓ Process tested
+
+# Standby promotion procedure (read/write mode)
+
+1. Prerequisite: complete the read-only failover procedure
+1. Shut off the master
+1. Run `pg_ctl promote` on the standby
+1. Convert the master to a standby
+
+✗ TODO: Test this process
 
 # Database upgrade procedure
 
 TBD
 
+✗ TODO: Test this process
+
 ## Minor version upgrades
 
 TBD
 
+✗ TODO: Test this process
+
 ## Major version upgrades
 
 TBD
 
+✗ TODO: Test this process
+
 # Accessing past data
 
 TBD
 
+✗ TODO: Test this process
+
 # Changes to make on the next primary reboot
 
 - Set `wal_level` to replica; we are not going to use logical replication. TODO:
@@ -95,10 +184,6 @@ TBD
 
 # Open questions
 
-- Would pgbouncer/pgpool be valuable here? It could be used to quickly redirect
-  traffic from some or all nodes to one server or another. If so, we should
-  probably put it on a dedicated host so that it's robust to host-level failures
-  in the primary or secondary servers.
 - What is the plan of execution for replacing the server hardware?
 - How often are WALs recycled, how far into the past can we peek?
 - Can we/should we detect failure conditions and automate the failover
@@ -107,13 +192,13 @@ TBD
 - How can we maintain an ongoing high level of confidence in our system?
 
 It may be possible to *always* service SELECT statements from either the primary
-or secondary. If we eventually require load balancing, these queries could be
+or standby. If we eventually require load balancing, these queries could be
 eligible. However, queries (or transactions?) which mutate the database may only
 be executed on the primary. Therefore:
 
 - If we require more capacity for writes, is vertical scaling our only option?
 - In the event of a primary outage, are we able to transparently direct reads
-  to the secondary?
+  to the standby?
 - Can application-level changes be implemented which would allow for reduced
   service availability (read-only) in the event of a primary outage?
 - If applications cannot write to the database, but can write to other data
@@ -123,22 +208,22 @@ be executed on the primary. Therefore:
 
 # Mock contingency scenarios
 
-We need to set up a test network with a primary and secondary and induce failure
+We need to set up a test network with a primary and standby and induce failure
 scenarios to learn how our applications behave under those conditions and
-practice the failure resolution steps.
+practice the failure resolution steps. Such a test network is already partially
+provisioned on Drew's home network.
 
 This will require at least the following:
 
 - A couple of sr.ht applications (meta.sr.ht and git.sr.ht would be good)
-- A primary and secondary PostgreSQL server
+- A primary and standby PostgreSQL server
 
-At least three separate VMs are required here, four if we set up a standalone
-pgbouncer or pgpool. Then we should play out the following scenarios:
+We should play out the following scenarios:
 
-- Promoting the secondary to the primary
-- Shut down the primary or secondary without pre-preparation
+- Promoting the standby to the primary
+- Shut down the primary or standby without pre-preparation
 - Accessing the database state at an arbitrary point in the past
-- Bringing the secondary back up to date after a primary outage
+- Bringing the standby back up to date after a primary outage
 - Major and minor version upgrades (pg 11 to 12)
 - Host OS upgrades (Alpine 3.11 to 3.12 or edge)
 - Cause a ZFS resliver on the primary and monitor database performance: does
@@ -147,3 +232,15 @@ pgbouncer or pgpool. Then we should play out the following scenarios:
 
 At each stage of these processes, observe application behavior and record these
 observations.
+
+# Verification process
+
+The whole setup needs to be performed on a secondary deployment and each service
+carefully tested under normal and failure mode conditions. Of particular
+interest is:
+
+- Do git.sr.ht's more sophisticated pg commands from the SSH pipeline work
+  correctly?
+- Does lists.sr.ht's prepared statements get along well with pgbouncer?
+- Do all services work reasonably well in a read-only mode?
+- Do all services work correctly with pgbouncer's transaction mode?
author	Drew DeVault <sir@cmpwn.com>	2020-06-30 14:42:29 -0400
committer	Drew DeVault <sir@cmpwn.com>	2020-06-30 14:42:29 -0400
commit	598c64ac1df4d3c8a02768419a544096db1865b4 (patch)
tree	e6cfdbe532535e85fbfc6d49abd4b3669ddaab1e /ops
parent	72fe32c37fd3cf615c368f0344edad310e425031 (diff)
download	sr.ht-docs-598c64ac1df4d3c8a02768419a544096db1865b4.tar.gz