From ac753cefd20b269497ffaa8ffe0ca45efd35b3c8 Mon Sep 17 00:00:00 2001
From: Drew DeVault <sir@cmpwn.com>
Date: Fri, 18 Sep 2020 12:06:21 -0400
Subject: Add ops/scale.md

---
 ops/index.md |  1 +
 ops/scale.md | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 90 insertions(+)
 create mode 100644 ops/scale.md

diff --git a/ops/index.md b/ops/index.md
index 6bcc8b1..dcb9be2 100644
--- a/ops/index.md
+++ b/ops/index.md
@@ -19,6 +19,7 @@ Additional resources:
 - [Network topology](/ops/topology.md)
 - [Provisioning & allocation](/ops/provisioning.md)
 - [PostgreSQL robustness planning](/ops/robust-psql.md)
+- [SourceHut scalability plans](/ops/scale.md)
 
 # Operational Resources
 
diff --git a/ops/scale.md b/ops/scale.md
new file mode 100644
index 0000000..badfccf
--- /dev/null
+++ b/ops/scale.md
@@ -0,0 +1,89 @@
+---
+title: SourceHut scalability plans
+---
+
+Planning ahead for how we can keep up with an increasing scale. The primary
+near-to-mid-term bottlenecks will be:
+
+- PostgreSQL
+- git.sr.ht
+- builds.sr.ht
+- Network bandwidth
+
+# General scaling considerations
+
+## GraphQL
+
+The throughput of our GraphQL backends is almost entirely constrained by the max
+SQL connections. Nearly all blocking requests are spending their time waiting
+for a connection to free up from the pool. CPU and RAM usage are negligible.
+
+All authentication is routed through meta.sr.ht for token revocation checks,
+which uses Redis as the source of truth. This may become a bottleneck in the
+future.
+
+## Python
+
+The future Python backend design is going to be pretty thin and mostly
+constrained by (1) simultaneous connections and (2) GraphQL throughput.
+
+We'll know more about how to address this after we decide if we're keeping
+Python around in the first place.
+
+## Network
+
+Our internet link is fairly cheap bargain shit. This is easy to fix but going to
+be expensive. Defer until we need it, pricing adjustment for the beta should
+take this into consideration.
+
+## Backups
+
+Storage utilization is fine, and easily tuned if necessary. The larger problem
+is that borg triggers lots of CPU consumption on the hosts which are being
+backed up. Managable now but a good candidate for future research.
+
+# Domain-specific concerns
+
+## PostgreSQL
+
+Storage is not really an issue, and load avg is consistently &lt;1 even during
+usage spikes. The main constraint is RAM; right now we're on 64GiB and using
+about half of it.
+
+We can tackle availability and load balancing in the same fell swoop. When we
+need to scale up more, we should provision two additional PostgreSQL servers to
+serve as read-only hot standbys. We can use pgbouncer to direct writable
+transactions to the master and load balance read-only transactions between all
+of the nodes. If we need to scale writes up, we can take the read-only load
+entirely off of the master server and spin up a third standby. The GraphQL
+backends are already transaction-oriented and use a read-only transaction when
+appropriate, so this would be fairly easy.
+
+Note: right now we have one hot standby but it serves as a failover and off-site
+backup, and is not typically load-bearing. Latency issues to the backup
+datacenter would likely make bringing it into normal service a non-starter.
+
+## git.sr.ht
+
+[RepoSpanner](https://github.com/repoSpanner/repoSpanner) may help with git
+storage distribution and availability. A bespoke solution would probably also be
+pretty straightforward.
+
+Disk utilization is currently growing at about [50G/week][0]. Presently this
+represents 5% of the provisioned capacity per week, which is too fast. [Thread
+here][1] for planning the reprovisioning.
+
+[0]: https://metrics.sr.ht/graph?g0.range_input=4w&g0.expr=((node_filesystem_size_bytes%7Binstance%3D%22node.git.sr.ht%3A80%22%2Cdevice%3D%22varz%22%7D%20-%20node_filesystem_avail_bytes%7Binstance%3D%22node.git.sr.ht%3A80%22%2Cdevice%3D%22varz%22%7D)%20%2F%20node_filesystem_size_bytes%7Binstance%3D%22node.git.sr.ht%3A80%22%2Cdevice%3D%22varz%22%7D)%20*%20100&g0.tab=0
+[1]: https://lists.sr.ht/~sircmpwn/sr.ht-dev/%3CC5QM8KFLQUHN.2796RCC83HBHA%40homura%3E
+
+## hg.sr.ht
+
+Mercurial has really bad performance. The load of hg.sr.ht per-user is about 10x
+of the per-user git.sr.ht load, but it sees about 1/10th the usage so it
+balances out more or less. I would like to see some upstream improvements from
+the Mercurial team to make hosting hg.sr.ht less expensive.
+
+Generating clonebundles is a unique concern which requires lots of CPU usage
+periodically.
+
+Storage utilization is growing at a managable pace, about 0.1%-0.2%/week.
-- 
cgit