diff options
Diffstat (limited to 'ops/availability.md')
-rw-r--r-- | ops/availability.md | 68 |
1 files changed, 68 insertions, 0 deletions
diff --git a/ops/availability.md b/ops/availability.md new file mode 100644 index 0000000..324df99 --- /dev/null +++ b/ops/availability.md @@ -0,0 +1,68 @@ +--- +title: High availability +--- + +High-availability has not been a priority for SourceHut during early alpha +development, but is becoming more important heading into the beta. This page is +more about our plans than it is about our implementation. + +The priorities are, in order: + +1. Highly available web services +2. Highly available database +3. Highly available mail system + +# Web services + +The web services are already mostly designed to avoid keeping local state +around, with this eventual goal in mind. Should investigate load balancing with +haproxy(?) so we can bring nodes into and out of service without downtime. +Should also make this the norm for deployments. + +## Special considerations for deployments + +- SQL migrations should be designed so that both the old and new systems work + correctly on both the old and new schemas. This will often require splitting + migrations over several releases. + +## Special considerations for git.sr.ht, hg.sr.ht + +We need to use something like [repospanner](https://github.com/repoSpanner/repoSpanner) +to distribute git pushes among several nodes. + +Can we do something similar for Mercurial? + +Related to [backups](/ops/backups.md). + +## Special considerations for builds.sr.ht + +The builds.sr.ht worker needs to be updated so that we can reboot it without +terminating anyone's jobs. One idea would be to move the job supervisor into a +separate process. An issue with this would be the new work scheduler adopting +job processes after a restart, and avoiding taking on new work from Celery until +the resources are freed up. + +Possible workaround is not accepting new jobs, letting the jobs drain while +other build hosts pick up the slack, then rebooting and accepting new jobs once +more. + +# Database + +???? + +[pgbouncer](https://www.pgbouncer.org/) will probably be of some use. I suspect +that we will find it difficult to reach zero-downtime failovers. Ideally, we +would be able to do PostgreSQL major version upgrades with minimal downtime. + +Care will need to be taken to avoid silently dropping writes. + +We need to set up an experimental test network for testing out these ideas, and +make a plan. + +# Highly available mail system + +This should be fairly trivial. We need to move the work distribution Redis +server from the mail host to the lists host (duh), and then just set up multiple +MX records. Zero-downtime migrations can be accomplished by removing an MX +record, letting the mail flush, and then doing whatever maintenance is +necessary. |