aboutsummaryrefslogtreecommitdiffstats
path: root/ops/scale.md
blob: efc254b7737d754f191aeaf39d1d7d2610817add (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
title: SourceHut scalability plans
---

Planning ahead for how we can keep up with an increasing scale.

The primary near-to-mid-term bottlenecks will be:

- PostgreSQL
- git.sr.ht
- builds.sr.ht
- Network bandwidth

Anything not mentioned here is thought to be operating well within performance
thresholds and scaling at a negligible pace.

# General scaling considerations

## GraphQL

The throughput of our GraphQL backends is almost entirely constrained by the max
SQL connections. Nearly all blocking requests are spending their time waiting
for a connection to free up from the pool. CPU and RAM usage are negligible.

All authentication is routed through meta.sr.ht for token revocation checks,
which uses Redis as the source of truth. This may become a bottleneck in the
future.

## Python

The future Python backend design is going to be pretty thin and mostly
constrained by (1) simultaneous connections and (2) GraphQL throughput.

We'll know more about how to address this after we decide if we're keeping
Python around in the first place.

## Network

Our internet link is fairly cheap bargain shit. This is easy to fix but going to
be expensive. Defer until we need it, pricing adjustment for the beta should
take this into consideration.

## Backups

Storage utilization is fine, and easily tuned if necessary. The larger problem
is that borg triggers lots of CPU consumption on the hosts which are being
backed up. Managable now but a good candidate for future research.

## Web load balancing

We're already designed with load balancing in mind. Balancing HTTP requests
across any number of web servers ought to be trivial. However, horizontal
scaling of web appliances is an expensive optimization, and for the most part
this is being considered with a low number of nodes (i.e. 3) for the purposes of
availability moreso than scaling. We should look into other scaling options
before reaching for web load balancing.

# Domain-specific concerns

## PostgreSQL

Storage is not really an issue, and load avg is consistently <1 even during
usage spikes. The main constraint is RAM; right now we're on 64GiB and using
about half of it.

We can tackle availability and load balancing in the same fell swoop. When we
need to scale up more, we should provision two additional PostgreSQL servers to
serve as read-only hot standbys. We can use pgbouncer to direct writable
transactions to the master and load balance read-only transactions between all
of the nodes. If we need to scale writes up, we can take the read-only load
entirely off of the master server and spin up a third standby. The GraphQL
backends are already transaction-oriented and use a read-only transaction when
appropriate, so this would be fairly easy.

If we need to scale writes horizontally, sharding should be in the cards. I
don't expect us to need that for a long time.

Note: right now we have one hot standby but it serves as a failover and off-site
backup, and is not typically load-bearing. Latency issues to the backup
datacenter would likely make bringing it into normal service a non-starter.

## git.sr.ht

[RepoSpanner](https://github.com/repoSpanner/repoSpanner) may help with git
storage distribution and availability. A bespoke solution would probably also be
pretty straightforward.

Disk utilization is currently growing at about [50G/week][0]. Presently this
represents 5% of the provisioned capacity per week, which is too fast. [Thread
here][1] for planning the reprovisioning.

![](https://metrics.sr.ht/chart.svg?title=git.sr.ht%20disk%20utilization%20over%20the%20past%20two%20weeks&query=%28node_filesystem_size_bytes{instance%3D%22node.git.sr.ht%3A80%22%2Cdevice%3D%22varz%22}%20-%20node_filesystem_avail_bytes{instance%3D%22node.git.sr.ht%3A80%22%2Cdevice%3D%22varz%22}%29%20%2F%20node_filesystem_size_bytes{instance%3D%22node.git.sr.ht%3A80%22%2Cdevice%3D%22varz%22}&since=336h&stacked&step=10000&height=3&width=10&max=1)

[0]: https://metrics.sr.ht/graph?g0.range_input=4w&g0.expr=((node_filesystem_size_bytes%7Binstance%3D%22node.git.sr.ht%3A80%22%2Cdevice%3D%22varz%22%7D%20-%20node_filesystem_avail_bytes%7Binstance%3D%22node.git.sr.ht%3A80%22%2Cdevice%3D%22varz%22%7D)%20%2F%20node_filesystem_size_bytes%7Binstance%3D%22node.git.sr.ht%3A80%22%2Cdevice%3D%22varz%22%7D)%20*%20100&g0.tab=0
[1]: https://lists.sr.ht/~sircmpwn/sr.ht-dev/%3CC5QM8KFLQUHN.2796RCC83HBHA%40homura%3E

## hg.sr.ht

Mercurial has really bad performance. The load of hg.sr.ht per-user is about 10x
of the per-user git.sr.ht load, but it sees about 1/10th the usage so it
balances out more or less. I would like to see some upstream improvements from
the Mercurial team to make hosting hg.sr.ht less expensive.

Generating clonebundles is a unique concern which requires lots of CPU usage
periodically.

Storage utilization is growing at a managable pace, about 0.1%-0.2%/week.

## builds.sr.ht

Watch this chart

![](https://metrics.sr.ht/chart.svg?title=Build%20worker%20load%20average&query=avg_over_time%28node_load15%7Binstance%3D~%22cirno%5B0-9%5D%2B.sr.ht%3A80%22%7D%5B1h%5D%29&max=64&since=336h&stacked&step=10000&height=3&width=10)