PostgreSQL is a critical part of the SourceHut infrastructure in terms of availability, performance, and data integrity. # Streaming replication **Goals** 1. Real-time off-site backup of the database 1. Ability to access the database as it appeared in the past 1. Ability to failover to the standby in the event of an outage on the primary, with minimal application downtime 1. Ability to failover to the standby during planned maintenance of the primary, with minimal application downtime **Prerequisites** - ✓ PostgreSQL 12.x on primary - ✓ PostgreSQL 12.x on standby **Nice-to-have** - ✗ Alpine 3.12 on primary - ✓ Alpine 3.12 on standby If we have to restart the primary PostgreSQL daemon for any reason, we should take the opportunity to upgrade the host to Alpine 3.12. ## Preparing the hot standby First we'll have to prepare a [warm standby][warm standby] (which can be promoted to primary at short notice, but does not accept queries), then we can reconfigure it as a [hot standby][hot standby] (which will accept queries). 1. Set up a user on the primary for the standby to connect as for streaming replication. It must have the REPLICATION and LOGIN privileges and [pg_hba.conf][hba changes] must be updated. 1. Take a [base backup][base backup] of the primary and ship it to the standby. Be advised: we have `full_page_writes` disabled as a ZFS optimization; this may cause performance degredation during the basebackup process. `pg_basebackup -PzFt -D basebackup`. Place 'basedirectory' in `/var/lib/postgresql`. 1. Nuke the postgres data directory's contents on the standby, su to postgres, then run these commands: `tar -C ~/12/data -xf ~/basebackup/base.tar.gz`, `tar -C ~/12/data/pg_wal -xf ~/basebackup/pg_wal.tar.gz`, `touch ~/12/data/standby.signal` 1. Fill out `primary_conninfo` on the standby with the connection details for the account created for replication. `host=... port=5432 user=... password=...` 1. Configure a replication slot on the primary for the standby: SELECT * FROM pg_create_physical_replication_slot('konpaku'); Then set `primary_slot_name` to `konpaku` on the standby. 1. Set `hot_standby` to "on" on the standby. 1. Start the database, monitor logs for restore completion. 1. Sanity check the standby database. Note: some of this procedure is adapted from the [restore the base backup][restore base backup] docs, but modified cause those docs are shit. [warm standby]: https://www.postgresql.org/docs/current/warm-standby.html [hot standby]: https://www.postgresql.org/docs/current/hot-standby.html [base backup]: https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-BASE-BACKUP [restore base backup]: https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-PITR-RECOVERY [hba changes]: https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION-AUTHENTICATION ✓ Process tested ✓ Completed in production # Deploying pgbouncer pgbouncer will allow us to quickly change which database server the fleet is using at any given moment. It does not handle load balancing directly, we would need to use a secondary approach for this if desirable (DNS round robin and haproxy are suggested by the docs). A configuration similar to the following will be deployed fleet-wide: ``` [pgbouncer] listen_addr = 127.0.0.1 listen_port = 16432 pool_mode = transaction auth_type = trust auth_file = /etc/pgbouncer/userlist.txt max_client_conn = 100 default_pool_size = 20 [databases] xxx.sr.ht=host=remilia.sr.ht user=xxx.sr.ht password=... ``` config.ini's will point at localhost through pgbouncer rather than directly at the database servers. The initd scripts will need to be updated to `use pgbouncer` so that services are started up correctly. ✓ Process tested # Application-level updates For web applications, we should abort requests which would write to the database and instead show a page explaining the circumstances. The user can attempt a refresh later to re-attempt the same action. For the SSH pipelines: - Repository auto-creation attempts should abort and tell the user - Repository mtime updates should be silently dropped - If unknown SSH keys can be fetched from meta.sr.ht, use them, but don't store them - Webhook delivery attempts should be dropped and a warning printed For builds.sr.ht, if any database update is met with a read-only error, retry with an exponential backoff and block until it succeeds. If it exceeds a deadline, then abort the build. For webhook processing celery workers, it should be acceptable for it to simply fail. Webhooks are only generated when the database is updated, so if the database can't be updated, ¯\\\_(ツ)\_/¯! There may be some room for future improvements in all of these areas, but this is sufficient for the MVP. ✓ core.sr.ht patch to (somewhat) gracefully handle a read-only database ✗ read-only support for celery jobs ✗ read-only support for git.sr.ht, hg.sr.ht SSH pipeline ✗ read-only support for ongoing builds.sr.ht jobs # Failover procedure (read-only mode) *If possible, this is to be announced in advance. It will cause a partial outage of all services.* First, prepare all services for failover by updating their pgbouncer configurations to point to the standby. Do not reload any of the pgbouncer instances until all service configs have been updated, then use a script to update them all at once. ```sh #!/bin/sh hosts=meta.sr.ht \ git.sr.ht \ hg.sr.ht \ builds.sr.ht \ lists.sr.ht \ todo.sr.ht \ man.sr.ht \ paste.sr.ht \ hub.sr.ht \ cirno1.sr.ht \ cirno2.sr.ht \ legacy.sr.ht \ mail-b.sr.ht for host in $hosts do ssh $host doas service pgbouncer reload done ``` ✓ Process tested # Standby promotion procedure (read/write mode) 1. Prerequisite: complete the read-only failover procedure 1. Sanity check the application connection to the standby 1. Shut off the master 1. Run `pg_ctl promote` on the standby ✓ Process tested # Master demotion procedure In this procedure, the server being demoted is referred to as the "old master" and the new master is referred to as such. 1. Prerequisite: complete the standby promotion procedure 1. Create a replication slot for the old master: SELECT * FROM pg_create_physical_replication_slot('name...'); 1. Create a user for the old master with login and replication privs 1. Fill out `primary_slot_name` and `primary_conninfo` in the old master's config, and set `hot_standby` to on 1. Confirm that the old master has replication access in the new master's `pg_hba.conf` 1. Start the old master and monitor the logs, it should catch up 1. Sanity check the database The old master is now the standby, and the old standby is now the master. ✓ Process tested # Database upgrade procedure ## Minor version upgrades This procedure will incur up to a few minutes of downtime, but ideally the outage can only last as little as a few seconds. 1. Complete the standby promotion procedure 1. Upgrade the master, then complete the master demotion procedure At this point, the master has become the standby and is now up-to-date. Repeat the process again to upgrade the new standby and return the master/standby assignments to normal. ✗ TODO: Test this process ## Major version upgrades This process will incur a longer service degredation. The services will be in read-only mode for up to 30 minutes. 1. Complete the failover procedure 1. Stop replication on the standby: `select pg_xlog_replay_pause()` 1. Perform the upgrade and reboot the host 1. Sanity check the database 1. Reverse the failover procedure 1. Complete the "Preparing the hot standby" procedure to rebuild the standby from scratch; do system updates and reboot ✗ TODO: Test this process # Accessing past data The easiest way to do this is likely to be mounting an old version of the ZFS dataset on either the master or standby server, and running a secondary PostgreSQL instance from it. ✗ TODO: Test this process # Changes to make on the next primary reboot - Set `wal_level` to replica; we are not going to use logical replication. TODO: Will this create WAL files which are incompatible with one another? - Drop the `archive_command` and set `archive_mode` to off; we're going to use streaming replication instead of log shipping. # Open questions - What is the plan of execution for replacing the server hardware? - How often are WALs recycled, how far into the past can we peek? - Can we/should we detect failure conditions and automate the failover procedure? - What kinds of monitoring and alarms can we/should we set up? - How can we maintain an ongoing high level of confidence in our system? It may be possible to *always* service SELECT statements from either the primary or standby. If we eventually require load balancing, these queries could be eligible. However, queries (or transactions?) which mutate the database may only be executed on the primary. Therefore: - If we require more capacity for writes, is vertical scaling our only option? - In the event of a primary outage, are we able to transparently direct reads to the standby? - Can application-level changes be implemented which would allow for reduced service availability (read-only) in the event of a primary outage? - If applications cannot write to the database, but can write to other data stores (filesystem, redis, etc), can inconsistent states be produced? - Can we simulate typical and excessive levels of application traffic against the database for mock scenarios? # Mock contingency scenarios We need to set up a test network with a primary and standby and induce failure scenarios to learn how our applications behave under those conditions and practice the failure resolution steps. Such a test network is already partially provisioned on Drew's home network. This will require at least the following: - A couple of sr.ht applications (meta.sr.ht and git.sr.ht would be good) - A primary and standby PostgreSQL server We should play out the following scenarios: - Promoting the standby to the primary - Shut down the primary or standby without pre-preparation - Accessing the database state at an arbitrary point in the past - Bringing the standby back up to date after a primary outage - Major and minor version upgrades (pg 11 to 12) - Host OS upgrades (Alpine 3.11 to 3.12 or edge) - Cause a ZFS resilver on the primary and monitor database performance: does this trigger a "soft outage" where performance is awful but the database is ostensibly online? At each stage of these processes, observe application behavior and record these observations. # Verification process The whole setup needs to be performed on a secondary deployment and each service carefully tested under normal and failure mode conditions. Of particular interest is: - Do git.sr.ht's more sophisticated pg commands from the SSH pipeline work correctly? - Does lists.sr.ht's prepared statements get along well with pgbouncer? - Do all services work reasonably well in a read-only mode? - Do all services work correctly with pgbouncer's transaction mode?