Add robust PSQL planning document

author: Drew DeVault <sir@cmpwn.com> 2020-06-30 10:47:56 -0400
committer: Drew DeVault <sir@cmpwn.com> 2020-06-30 10:48:20 -0400
commit: 134328b0ed45995ec33928d0f6738bcdd071fe0a (patch)
tree: e4aa302038351a41f642673fc45ae368dd745d6c /ops
parent: 6355d0aefc1bb120d956076b84ca41936b36f3d8 (diff)
download: sr.ht-docs-134328b0ed45995ec33928d0f6738bcdd071fe0a.tar.gz
2 files changed, 144 insertions, 0 deletions
diff --git a/ops/index.md b/ops/index.md
index dc9c75e..6aaeb95 100644
--- a/ops/index.md
+++ b/ops/index.md
@@ -17,6 +17,7 @@ Additional resources:
 - [Monitoring & alarms](/ops/monitoring.md)
 - [Network topology](/ops/topology.md)
 - [Provisioning & allocation](/ops/provisioning.md)
+- [PostgreSQL robustness planning](/ops/robust-psql.md)
 
 # Operational Resources
 
diff --git a/ops/robust-psql.md b/ops/robust-psql.md
new file mode 100644
index 0000000..bd24387
--- /dev/null
+++ b/ops/robust-psql.md
@@ -0,0 +1,143 @@
+PostgreSQL is a critical part of the SourceHut infrastructure in terms of
+availability, performance, and data integrity.
+
+# Streaming replication
+
+**Goals**
+
+1. Real-time off-site backup of the database
+1. Ability to access the database as it appeared in the past
+1. Ability to failover to the secondary in the event of an outage on the
+   primary, with minimal application downtime
+1. Ability to failover to the secondary during planned maintenance of the
+   primary, with minimal application downtime
+
+**Prerequisites**
+
+✓ PostgreSQL 12.x on primary
+
+✓ PostgreSQL 12.x on secondary
+
+**Nice-to-have**
+
+✗ Alpine 3.12 on primary
+
+✓ Alpine 3.12 on secondary
+
+If we have to restart the primary PostgreSQL daemon for any reason, we should
+take the opportunity to upgrade the host to Alpine 3.12.
+
+## Preparing the hot standby
+
+First we'll have to prepare a [warm standby][warm standby] (which can be
+promoted to primary at short notice, but does not accept queries), then we can
+reconfigure it as a [hot standby][hot standby] (which will accept queries).
+
+1. Set up a user on the primary for the secondary to connect as for streaming
+   replication. It must have the REPLICATION and LOGIN privileges and
+   [pg_hba.conf][hba changes] must be updated.
+1. Take a [base backup][base backup] of the primary and ship it to the
+   secondary. Be advised: we have full_page_writes disabled as a ZFS
+   optimization; this may cause performance degredation during the basebackup
+   process.
+1. Fill out `primary_conninfo` on the secondary with the connection details for
+   the account created for replication.
+1. Configure a replication slot on the primary for the secondary:
+    
+        SELECT * FROM pg_create_physical_replication_slot('konpaku');
+
+    Then set `primary_slot_name` to `konpaku` on the secondary.
+1. Set `hot_standby` to true on the secondary.
+1. [Restore the base backup][restore base backup] on the secondary. It will
+   automatically sync with the primary when complete; monitor the logs. TODO:
+   What should our `restore_command` be? Want to figure it out in advance.
+1. Sanity check the secondary database.
+1. Roll back the `restore_command` change so that we aren't surprised by it
+   later. Drop the base backup files as well.
+
+[warm standby]: https://www.postgresql.org/docs/current/warm-standby.html
+[hot standby]: https://www.postgresql.org/docs/current/hot-standby.html
+[base backup]: https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-BASE-BACKUP
+[restore base backup]: https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-PITR-RECOVERY
+[hba changes]: https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION-AUTHENTICATION
+
+# Failover procedure (primary & secondary role swap)
+
+TBD
+
+# Database upgrade procedure
+
+TBD
+
+## Minor version upgrades
+
+TBD
+
+## Major version upgrades
+
+TBD
+
+# Accessing past data
+
+TBD
+
+# Changes to make on the next primary reboot
+
+- Set `wal_level` to replica; we are not going to use logical replication. TODO:
+  Will this create WAL files which are incompatible with one another?
+- Drop the `archive_command` and set `archive_mode` to off; we're going to use
+  streaming replication instead of log shipping.
+
+# Open questions
+
+- Would pgbouncer/pgpool be valuable here? It could be used to quickly redirect
+  traffic from some or all nodes to one server or another. If so, we should
+  probably put it on a dedicated host so that it's robust to host-level failures
+  in the primary or secondary servers.
+- What is the plan of execution for replacing the server hardware?
+- How often are WALs recycled, how far into the past can we peek?
+- Can we/should we detect failure conditions and automate the failover
+  procedure?
+- What kinds of monitoring and alarms can we/should we set up?
+- How can we maintain an ongoing high level of confidence in our system?
+
+It may be possible to *always* service SELECT statements from either the primary
+or secondary. If we eventually require load balancing, these queries could be
+eligible. However, queries (or transactions?) which mutate the database may only
+be executed on the primary. Therefore:
+
+- If we require more capacity for writes, is vertical scaling our only option?
+- In the event of a primary outage, are we able to transparently direct reads
+  to the secondary?
+- Can application-level changes be implemented which would allow for reduced
+  service availability (read-only) in the event of a primary outage?
+- If applications cannot write to the database, but can write to other data
+  stores (filesystem, redis, etc), can inconsistent states be produced?
+- Can we simulate typical and excessive levels of application traffic against
+  the database for mock scenarios?
+
+# Mock contingency scenarios
+
+We need to set up a test network with a primary and secondary and induce failure
+scenarios to learn how our applications behave under those conditions and
+practice the failure resolution steps.
+
+This will require at least the following:
+
+- A couple of sr.ht applications (meta.sr.ht and git.sr.ht would be good)
+- A primary and secondary PostgreSQL server
+
+At least three separate VMs are required here, four if we set up a standalone
+pgbouncer or pgpool. Then we should play out the following scenarios:
+
+- Promoting the secondary to the primary
+- Shut down the primary or secondary without pre-preparation
+- Accessing the database state at an arbitrary point in the past
+- Major and minor version upgrades (pg 11 to 12)
+- Host OS upgrades (Alpine 3.11 to 3.12 or edge)
+- Cause a ZFS resliver on the primary and monitor database performance: does
+  this trigger a "soft outage" where performance is awful but the database is
+  ostensibly online?
+
+At each stage of these processes, observe application behavior and record these
+observations.
author	Drew DeVault <sir@cmpwn.com>	2020-06-30 10:47:56 -0400
committer	Drew DeVault <sir@cmpwn.com>	2020-06-30 10:48:20 -0400
commit	134328b0ed45995ec33928d0f6738bcdd071fe0a (patch)
tree	e4aa302038351a41f642673fc45ae368dd745d6c /ops
parent	6355d0aefc1bb120d956076b84ca41936b36f3d8 (diff)
download	sr.ht-docs-134328b0ed45995ec33928d0f6738bcdd071fe0a.tar.gz