path: root/ops/monitoring.md



---
title: "Monitoring & alarms"
---

We monitor everything with Prometheus, and configure alarms with alertmanager.

# Public metrics

Our Prometheus instance is publically available at
[metrics.sr.ht](https://metrics.sr.ht).

## Areas for improvement

1. We should make dashboards. It would be pretty to look at and could be a
   useful tool for root cause analysis. Note that some users who have their own
   Grafana instance have pointed it at our public Prometheus data and made some
   simple dashboards - I would be open to having community ownership over this.

# Pushgateway

A pushgateway is running at push.metrics.sr.ht. It's firewalled to only accept
connections from [our subnet](/ops/topology.md).

# Aggregation gateway

[prom-aggregation-gateway](https://github.com/weaveworks/prom-aggregation-gateway)
is running at aggr.metrics.sr.ht. It's firewalled to only accept connections
from [our subnet](/ops/topology.md).

# Alertmanager

We use alertmanager to forward [alerts](https://metrics.sr.ht/alerts) to various
sinks.

- **interesting** alerts are forwarded to the IRC channel, #sr.ht.ops
- **important** alerts are sent the ops mailing list, and the IRC channel
- **urgent** alerts page Drew's phone, are sent to the mailing list, and the IRC
  channel

Some security-related alarms are sent directly to Drew and are not made public.

Our alerts are configured here:

https://git.sr.ht/~sircmpwn/metrics.sr.ht

# Areas for improvement

1. Would be nice to have centralized logging. There is sensitive information in
   some of our logs, so this probably can't be made public.