blob: 3baba70c92816cb498b68c6a617b734d48134013 (
plain) (
blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
|
---
title: "Monitoring & alarms"
---
We monitor everything with Prometheus, and configure alarms with alertmanager.
# Public metrics
Our Prometheus instance is publically available at
[metrics.sr.ht](https://metrics.sr.ht).
## Areas for improvement
1. We should make dashboards. It would be pretty to look at and could be a
useful tool for root cause analysis. Note that some users who have their own
Grafana instance have pointed it at our public Prometheus data and made some
simple dashboards — I would be open to having community ownership over this.
# Pushgateway
A pushgateway is running at push.metrics.sr.ht. It's firewalled to only accept
connections from [our subnet](/ops/topology.md).
# Aggregation gateway
[prom-aggregation-gateway](https://github.com/weaveworks/prom-aggregation-gateway)
is running at aggr.metrics.sr.ht. It's firewalled to only accept connections
from [our subnet](/ops/topology.md).
# Alertmanager
We use alertmanager to forward [alerts](https://metrics.sr.ht/alerts) to various
sinks.
- **interesting** alerts are forwarded to the IRC channel, #sr.ht.ops
- **important** alerts are sent the ops mailing list, and the IRC channel
- **urgent** alerts page Drew's phone, are sent to the mailing list, and the IRC
channel
Some security-related alarms are sent directly to Drew and are not made public.
Our alerts are configured here:
https://git.sr.ht/~sircmpwn/metrics.sr.ht
## Configuring good alarms
Alarm urgency levels correspond to the appropriate response times during an
[incident](/ops/incident.md); configure them accordingly. Alarms should not be
too noisy, ideally any alarm should always require attention to reduce the risk
of [alarm fatigue](https://en.wikipedia.org/wiki/Alarm_fatigue).
Generally we should aim to set up alarms to predict problems before they occur.
How far in advance should be determined by the lead time on a solution. For
example, the lead time on securing new hard drives is a few weeks, so "drive
full" alarms are planned out based on the expected growth rate of the filesystem
to occur a few weeks before they will be full (with a generous margin for
error). chat.sr.ht has alarms for when we have a number of users for each IRC
network exceeding the number of slots allocated to us by that network with
sufficient advance notice to coordinate an increase to our allotment.
# Areas for improvement
1. Would be nice to have centralized logging. There is sensitive information in
some of our logs, so this probably can't be made public.
|