aboutsummaryrefslogblamecommitdiffstats
path: root/ops/monitoring.md
blob: 833a2719200abd56b36de2c38612bdbc86f008cd (plain) (tree)















































                                                                                       
---
title: "Monitoring & alarms"
---

We monitor everything with Prometheus, and configure alarms with alertmanager.

# Public metrics

Our Prometheus instance is publically available at
[metrics.sr.ht](https://metrics.sr.ht).

## Areas for improvement

1. We should make dashboards. It would be pretty to look at and could be a
   useful tool for root cause analysis. Note that some users who have their own
   Grafana instance have pointed it at our public Prometheus data and made some
   simple dashboards - I would be open to having community ownership over this.

# Pushgateway

A pushgateway is running at push.metrics.sr.ht. It's firewalled to only accept
connections from [our subnet](/ops/topology.md).

# Aggregation gateway

[prom-aggregation-gateway](https://github.com/weaveworks/prom-aggregation-gateway)
is running at aggr.metrics.sr.ht. It's firewalled to only accept connections
from [our subnet](/ops/topology.md).

# Alertmanager

We use alertmanager to forward [alerts](https://metrics.sr.ht/alerts) to various
sinks.

- **interesting** alerts are forwarded to the IRC channel, #sr.ht.ops
- **important** alerts are sent the ops mailing list, and the IRC channel
- **urgent** alerts page Drew's phone, are sent to the mailing list, and the IRC
  channel

Some security-related alarms are sent directly to Drew and are not made public.

# Areas for improvement

1. Would be nice to have centralized logging. There is sensitive information in
   some of our logs, so this probably can't be made public.
2. Several of our physical hosts are not being monitored. This will be resolved
   during [planned maintenance](https://status.sr.ht/issues/2020-03-11-planned-outage/)
   on March 3rd, 2020.