From 179beebd4b8b0337e380361cef562d72e6e4c23c Mon Sep 17 00:00:00 2001 From: Drew DeVault Date: Tue, 23 Aug 2022 10:01:04 +0200 Subject: ops/monitoring.md: advice on good alarms --- ops/monitoring.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/ops/monitoring.md b/ops/monitoring.md index ab6ec90..3baba70 100644 --- a/ops/monitoring.md +++ b/ops/monitoring.md @@ -43,6 +43,22 @@ Our alerts are configured here: https://git.sr.ht/~sircmpwn/metrics.sr.ht +## Configuring good alarms + +Alarm urgency levels correspond to the appropriate response times during an +[incident](/ops/incident.md); configure them accordingly. Alarms should not be +too noisy, ideally any alarm should always require attention to reduce the risk +of [alarm fatigue](https://en.wikipedia.org/wiki/Alarm_fatigue). + +Generally we should aim to set up alarms to predict problems before they occur. +How far in advance should be determined by the lead time on a solution. For +example, the lead time on securing new hard drives is a few weeks, so "drive +full" alarms are planned out based on the expected growth rate of the filesystem +to occur a few weeks before they will be full (with a generous margin for +error). chat.sr.ht has alarms for when we have a number of users for each IRC +network exceeding the number of slots allocated to us by that network with +sufficient advance notice to coordinate an increase to our allotment. + # Areas for improvement 1. Would be nice to have centralized logging. There is sensitive information in -- cgit