ops/emergency-planning.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

---
title: Emergency planning
---

On several occasions, outages have been simulated and the motions carried out
for resolving them. This is useful for:

1. Testing that our systems can tolerate or recover from such failures
2. Familiarizing operators with the resolution procedures

This has been conducted informally. We should put some more structure to it, and
plan these events regularly.

Ideas:

- Simulate disk failures (yank out a hard drive!)
- Simulate outages for redundant services
  (see [availability](/ops/availability.md))
- Kill celery workers and see how they cope with catching up again
- Restore systems from backup, then put the restored system into normal service
  and tear down the original