1 files changed, 68 insertions, 0 deletions
diff --git a/ops/availability.md b/ops/availability.md
new file mode 100644
index 0000000..324df99
--- /dev/null
+++ b/ops/availability.md
@@ -0,0 +1,68 @@
+---
+title: High availability
+---
+
+High-availability has not been a priority for SourceHut during early alpha
+development, but is becoming more important heading into the beta. This page is
+more about our plans than it is about our implementation.
+
+The priorities are, in order:
+
+1. Highly available web services
+2. Highly available database
+3. Highly available mail system
+
+# Web services
+
+The web services are already mostly designed to avoid keeping local state
+around, with this eventual goal in mind. Should investigate load balancing with
+haproxy(?) so we can bring nodes into and out of service without downtime.
+Should also make this the norm for deployments.
+
+## Special considerations for deployments
+
+- SQL migrations should be designed so that both the old and new systems work
+  correctly on both the old and new schemas. This will often require splitting
+  migrations over several releases.
+
+## Special considerations for git.sr.ht, hg.sr.ht
+
+We need to use something like [repospanner](https://github.com/repoSpanner/repoSpanner)
+to distribute git pushes among several nodes.
+
+Can we do something similar for Mercurial?
+
+Related to [backups](/ops/backups.md).
+
+## Special considerations for builds.sr.ht
+
+The builds.sr.ht worker needs to be updated so that we can reboot it without
+terminating anyone's jobs. One idea would be to move the job supervisor into a
+separate process. An issue with this would be the new work scheduler adopting
+job processes after a restart, and avoiding taking on new work from Celery until
+the resources are freed up.
+
+Possible workaround is not accepting new jobs, letting the jobs drain while
+other build hosts pick up the slack, then rebooting and accepting new jobs once
+more.
+
+# Database
+
+????
+
+[pgbouncer](https://www.pgbouncer.org/) will probably be of some use. I suspect
+that we will find it difficult to reach zero-downtime failovers. Ideally, we
+would be able to do PostgreSQL major version upgrades with minimal downtime.
+
+Care will need to be taken to avoid silently dropping writes.
+
+We need to set up an experimental test network for testing out these ideas, and
+make a plan.
+
+# Highly available mail system
+
+This should be fairly trivial. We need to move the work distribution Redis
+server from the mail host to the lists host (duh), and then just set up multiple
+MX records. Zero-downtime migrations can be accomplished by removing an MX
+record, letting the mail flush, and then doing whatever maintenance is
+necessary.