blob: 833a2719200abd56b36de2c38612bdbc86f008cd (
plain) (
tree)
|
|
---
title: "Monitoring & alarms"
---
We monitor everything with Prometheus, and configure alarms with alertmanager.
# Public metrics
Our Prometheus instance is publically available at
[metrics.sr.ht](https://metrics.sr.ht).
## Areas for improvement
1. We should make dashboards. It would be pretty to look at and could be a
useful tool for root cause analysis. Note that some users who have their own
Grafana instance have pointed it at our public Prometheus data and made some
simple dashboards - I would be open to having community ownership over this.
# Pushgateway
A pushgateway is running at push.metrics.sr.ht. It's firewalled to only accept
connections from [our subnet](/ops/topology.md).
# Aggregation gateway
[prom-aggregation-gateway](https://github.com/weaveworks/prom-aggregation-gateway)
is running at aggr.metrics.sr.ht. It's firewalled to only accept connections
from [our subnet](/ops/topology.md).
# Alertmanager
We use alertmanager to forward [alerts](https://metrics.sr.ht/alerts) to various
sinks.
- **interesting** alerts are forwarded to the IRC channel, #sr.ht.ops
- **important** alerts are sent the ops mailing list, and the IRC channel
- **urgent** alerts page Drew's phone, are sent to the mailing list, and the IRC
channel
Some security-related alarms are sent directly to Drew and are not made public.
# Areas for improvement
1. Would be nice to have centralized logging. There is sensitive information in
some of our logs, so this probably can't be made public.
2. Several of our physical hosts are not being monitored. This will be resolved
during [planned maintenance](https://status.sr.ht/issues/2020-03-11-planned-outage/)
on March 3rd, 2020.
|