--- title: "Monitoring & alarms" --- We monitor everything with Prometheus, and configure alarms with alertmanager. # Public metrics Our Prometheus instance is publically available at [metrics.sr.ht](https://metrics.sr.ht). ## Areas for improvement 1. We should make dashboards. It would be pretty to look at and could be a useful tool for root cause analysis. Note that some users who have their own Grafana instance have pointed it at our public Prometheus data and made some simple dashboards - I would be open to having community ownership over this. # Pushgateway A pushgateway is running at push.metrics.sr.ht. It's firewalled to only accept connections from [our subnet](/ops/topology.md). # Aggregation gateway [prom-aggregation-gateway](https://github.com/weaveworks/prom-aggregation-gateway) is running at aggr.metrics.sr.ht. It's firewalled to only accept connections from [our subnet](/ops/topology.md). # Alertmanager We use alertmanager to forward [alerts](https://metrics.sr.ht/alerts) to various sinks. - **interesting** alerts are forwarded to the IRC channel, #sr.ht.ops - **important** alerts are sent the ops mailing list, and the IRC channel - **urgent** alerts page Drew's phone, are sent to the mailing list, and the IRC channel Some security-related alarms are sent directly to Drew and are not made public. # Areas for improvement 1. Would be nice to have centralized logging. There is sensitive information in some of our logs, so this probably can't be made public. 2. Several of our physical hosts are not being monitored. This will be resolved during [planned maintenance](https://status.sr.ht/issues/2020-03-11-planned-outage/) on March 11th, 2020.