ops/monitoring.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65

---
title: "Monitoring & alarms"
---

We monitor everything with Prometheus, and configure alarms with alertmanager.

# Public metrics

Our Prometheus instance is publically available at
[metrics.sr.ht](https://metrics.sr.ht).

## Areas for improvement

1. We should make dashboards. It would be pretty to look at and could be a
   useful tool for root cause analysis. Note that some users who have their own
   Grafana instance have pointed it at our public Prometheus data and made some
   simple dashboards — I would be open to having community ownership over this.

# Pushgateway

A pushgateway is running at push.metrics.sr.ht. It's firewalled to only accept
connections from [our subnet](/ops/topology.md).

# Aggregation gateway

[prom-aggregation-gateway](https://github.com/weaveworks/prom-aggregation-gateway)
is running at aggr.metrics.sr.ht. It's firewalled to only accept connections
from [our subnet](/ops/topology.md).

# Alertmanager

We use alertmanager to forward [alerts](https://metrics.sr.ht/alerts) to various
sinks.

- **interesting** alerts are forwarded to the IRC channel, #sr.ht.ops
- **important** alerts are sent the ops mailing list, and the IRC channel
- **urgent** alerts page Drew's phone, are sent to the mailing list, and the IRC
  channel

Some security-related alarms are sent directly to Drew and are not made public.

Our alerts are configured here:

https://git.sr.ht/~sircmpwn/metrics.sr.ht

## Configuring good alarms

Alarm urgency levels correspond to the appropriate response times during an
[incident](/ops/incident.md); configure them accordingly. Alarms should not be
too noisy, ideally any alarm should always require attention to reduce the risk
of [alarm fatigue](https://en.wikipedia.org/wiki/Alarm_fatigue).

Generally we should aim to set up alarms to predict problems before they occur.
How far in advance should be determined by the lead time on a solution. For
example, the lead time on securing new hard drives is a few weeks, so "drive
full" alarms are planned out based on the expected growth rate of the filesystem
to occur a few weeks before they will be full (with a generous margin for
error). chat.sr.ht has alarms for when we have a number of users for each IRC
network exceeding the number of slots allocated to us by that network with
sufficient advance notice to coordinate an increase to our allotment.

# Areas for improvement

1. Would be nice to have centralized logging. There is sensitive information in
   some of our logs, so this probably can't be made public.