--- title: "Monitoring & alarms" --- We monitor everything with Prometheus, and configure alarms with alertmanager. # Public metrics Our Prometheus instance is publically available at [metrics.sr.ht](https://metrics.sr.ht). ## Areas for improvement 1. We should make dashboards. It would be pretty to look at and could be a useful tool for root cause analysis. Note that some users who have their own Grafana instance have pointed it at our public Prometheus data and made some simple dashboards — I would be open to having community ownership over this. # Pushgateway A pushgateway is running at push.metrics.sr.ht. It's firewalled to only accept connections from [our subnet](/ops/topology.md). # Aggregation gateway [prom-aggregation-gateway](https://github.com/weaveworks/prom-aggregation-gateway) is running at aggr.metrics.sr.ht. It's firewalled to only accept connections from [our subnet](/ops/topology.md). # Alertmanager We use alertmanager to forward [alerts](https://metrics.sr.ht/alerts) to various sinks. - **interesting** alerts are forwarded to the IRC channel, #sr.ht.ops - **important** alerts are sent the ops mailing list, and the IRC channel - **urgent** alerts page Drew's phone, are sent to the mailing list, and the IRC channel Some security-related alarms are sent directly to Drew and are not made public. Our alerts are configured here: https://git.sr.ht/~sircmpwn/metrics.sr.ht ## Configuring good alarms Alarm urgency levels correspond to the appropriate response times during an [incident](/ops/incident.md); configure them accordingly. Alarms should not be too noisy, ideally any alarm should always require attention to reduce the risk of [alarm fatigue](https://en.wikipedia.org/wiki/Alarm_fatigue). Generally we should aim to set up alarms to predict problems before they occur. How far in advance should be determined by the lead time on a solution. For example, the lead time on securing new hard drives is a few weeks, so "drive full" alarms are planned out based on the expected growth rate of the filesystem to occur a few weeks before they will be full (with a generous margin for error). chat.sr.ht has alarms for when we have a number of users for each IRC network exceeding the number of slots allocated to us by that network with sufficient advance notice to coordinate an increase to our allotment. # Areas for improvement 1. Would be nice to have centralized logging. There is sensitive information in some of our logs, so this probably can't be made public.