From ac753cefd20b269497ffaa8ffe0ca45efd35b3c8 Mon Sep 17 00:00:00 2001 From: Drew DeVault Date: Fri, 18 Sep 2020 12:06:21 -0400 Subject: Add ops/scale.md --- ops/index.md | 1 + ops/scale.md | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 90 insertions(+) create mode 100644 ops/scale.md diff --git a/ops/index.md b/ops/index.md index 6bcc8b1..dcb9be2 100644 --- a/ops/index.md +++ b/ops/index.md @@ -19,6 +19,7 @@ Additional resources: - [Network topology](/ops/topology.md) - [Provisioning & allocation](/ops/provisioning.md) - [PostgreSQL robustness planning](/ops/robust-psql.md) +- [SourceHut scalability plans](/ops/scale.md) # Operational Resources diff --git a/ops/scale.md b/ops/scale.md new file mode 100644 index 0000000..badfccf --- /dev/null +++ b/ops/scale.md @@ -0,0 +1,89 @@ +--- +title: SourceHut scalability plans +--- + +Planning ahead for how we can keep up with an increasing scale. The primary +near-to-mid-term bottlenecks will be: + +- PostgreSQL +- git.sr.ht +- builds.sr.ht +- Network bandwidth + +# General scaling considerations + +## GraphQL + +The throughput of our GraphQL backends is almost entirely constrained by the max +SQL connections. Nearly all blocking requests are spending their time waiting +for a connection to free up from the pool. CPU and RAM usage are negligible. + +All authentication is routed through meta.sr.ht for token revocation checks, +which uses Redis as the source of truth. This may become a bottleneck in the +future. + +## Python + +The future Python backend design is going to be pretty thin and mostly +constrained by (1) simultaneous connections and (2) GraphQL throughput. + +We'll know more about how to address this after we decide if we're keeping +Python around in the first place. + +## Network + +Our internet link is fairly cheap bargain shit. This is easy to fix but going to +be expensive. Defer until we need it, pricing adjustment for the beta should +take this into consideration. + +## Backups + +Storage utilization is fine, and easily tuned if necessary. The larger problem +is that borg triggers lots of CPU consumption on the hosts which are being +backed up. Managable now but a good candidate for future research. + +# Domain-specific concerns + +## PostgreSQL + +Storage is not really an issue, and load avg is consistently <1 even during +usage spikes. The main constraint is RAM; right now we're on 64GiB and using +about half of it. + +We can tackle availability and load balancing in the same fell swoop. When we +need to scale up more, we should provision two additional PostgreSQL servers to +serve as read-only hot standbys. We can use pgbouncer to direct writable +transactions to the master and load balance read-only transactions between all +of the nodes. If we need to scale writes up, we can take the read-only load +entirely off of the master server and spin up a third standby. The GraphQL +backends are already transaction-oriented and use a read-only transaction when +appropriate, so this would be fairly easy. + +Note: right now we have one hot standby but it serves as a failover and off-site +backup, and is not typically load-bearing. Latency issues to the backup +datacenter would likely make bringing it into normal service a non-starter. + +## git.sr.ht + +[RepoSpanner](https://github.com/repoSpanner/repoSpanner) may help with git +storage distribution and availability. A bespoke solution would probably also be +pretty straightforward. + +Disk utilization is currently growing at about [50G/week][0]. Presently this +represents 5% of the provisioned capacity per week, which is too fast. [Thread +here][1] for planning the reprovisioning. + +[0]: https://metrics.sr.ht/graph?g0.range_input=4w&g0.expr=((node_filesystem_size_bytes%7Binstance%3D%22node.git.sr.ht%3A80%22%2Cdevice%3D%22varz%22%7D%20-%20node_filesystem_avail_bytes%7Binstance%3D%22node.git.sr.ht%3A80%22%2Cdevice%3D%22varz%22%7D)%20%2F%20node_filesystem_size_bytes%7Binstance%3D%22node.git.sr.ht%3A80%22%2Cdevice%3D%22varz%22%7D)%20*%20100&g0.tab=0 +[1]: https://lists.sr.ht/~sircmpwn/sr.ht-dev/%3CC5QM8KFLQUHN.2796RCC83HBHA%40homura%3E + +## hg.sr.ht + +Mercurial has really bad performance. The load of hg.sr.ht per-user is about 10x +of the per-user git.sr.ht load, but it sees about 1/10th the usage so it +balances out more or less. I would like to see some upstream improvements from +the Mercurial team to make hosting hg.sr.ht less expensive. + +Generating clonebundles is a unique concern which requires lots of CPU usage +periodically. + +Storage utilization is growing at a managable pace, about 0.1%-0.2%/week. -- cgit