aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorDrew DeVault <sir@cmpwn.com>2022-08-23 10:01:04 +0200
committerDrew DeVault <sir@cmpwn.com>2022-08-23 10:01:04 +0200
commit179beebd4b8b0337e380361cef562d72e6e4c23c (patch)
treeb0f91d29f5a46c262ffff57383b24ce55a2e95b2
parent5f77aa260acf3b6082fb6dbac02e9e26b65a5b09 (diff)
downloadsr.ht-docs-179beebd4b8b0337e380361cef562d72e6e4c23c.tar.gz
ops/monitoring.md: advice on good alarms
-rw-r--r--ops/monitoring.md16
1 files changed, 16 insertions, 0 deletions
diff --git a/ops/monitoring.md b/ops/monitoring.md
index ab6ec90..3baba70 100644
--- a/ops/monitoring.md
+++ b/ops/monitoring.md
@@ -43,6 +43,22 @@ Our alerts are configured here:
https://git.sr.ht/~sircmpwn/metrics.sr.ht
+## Configuring good alarms
+
+Alarm urgency levels correspond to the appropriate response times during an
+[incident](/ops/incident.md); configure them accordingly. Alarms should not be
+too noisy, ideally any alarm should always require attention to reduce the risk
+of [alarm fatigue](https://en.wikipedia.org/wiki/Alarm_fatigue).
+
+Generally we should aim to set up alarms to predict problems before they occur.
+How far in advance should be determined by the lead time on a solution. For
+example, the lead time on securing new hard drives is a few weeks, so "drive
+full" alarms are planned out based on the expected growth rate of the filesystem
+to occur a few weeks before they will be full (with a generous margin for
+error). chat.sr.ht has alarms for when we have a number of users for each IRC
+network exceeding the number of slots allocated to us by that network with
+sufficient advance notice to coordinate an increase to our allotment.
+
# Areas for improvement
1. Would be nice to have centralized logging. There is sensitive information in