aboutsummaryrefslogblamecommitdiffstats
path: root/ops/availability.md
blob: 324df99927a24f576d31176bf212767cbc82d05b (plain) (tree)



































































                                                                                       
---
title: High availability
---

High-availability has not been a priority for SourceHut during early alpha
development, but is becoming more important heading into the beta. This page is
more about our plans than it is about our implementation.

The priorities are, in order:

1. Highly available web services
2. Highly available database
3. Highly available mail system

# Web services

The web services are already mostly designed to avoid keeping local state
around, with this eventual goal in mind. Should investigate load balancing with
haproxy(?) so we can bring nodes into and out of service without downtime.
Should also make this the norm for deployments.

## Special considerations for deployments

- SQL migrations should be designed so that both the old and new systems work
  correctly on both the old and new schemas. This will often require splitting
  migrations over several releases.

## Special considerations for git.sr.ht, hg.sr.ht

We need to use something like [repospanner](https://github.com/repoSpanner/repoSpanner)
to distribute git pushes among several nodes.

Can we do something similar for Mercurial?

Related to [backups](/ops/backups.md).

## Special considerations for builds.sr.ht

The builds.sr.ht worker needs to be updated so that we can reboot it without
terminating anyone's jobs. One idea would be to move the job supervisor into a
separate process. An issue with this would be the new work scheduler adopting
job processes after a restart, and avoiding taking on new work from Celery until
the resources are freed up.

Possible workaround is not accepting new jobs, letting the jobs drain while
other build hosts pick up the slack, then rebooting and accepting new jobs once
more.

# Database

????

[pgbouncer](https://www.pgbouncer.org/) will probably be of some use. I suspect
that we will find it difficult to reach zero-downtime failovers. Ideally, we
would be able to do PostgreSQL major version upgrades with minimal downtime.

Care will need to be taken to avoid silently dropping writes.

We need to set up an experimental test network for testing out these ideas, and
make a plan.

# Highly available mail system

This should be fairly trivial. We need to move the work distribution Redis
server from the mail host to the lists host (duh), and then just set up multiple
MX records. Zero-downtime migrations can be accomplished by removing an MX
record, letting the mail flush, and then doing whatever maintenance is
necessary.