--- title: Emergency planning --- On several occasions, outages have been simulated and the motions carried out for resolving them. This is useful for: 1. Testing that our systems can tolerate or recover from such failures 2. Familiarizing operators with the resolution procedures This has been conducted informally. We should put some more structure to it, and plan these events regularly. Ideas: - Simulate disk failures (yank out a hard drive!) - Simulate outages for redundant services (see [availability](/ops/availability.md)) - Kill celery workers and see how they cope with catching up again - Restore systems from backup, then put the restored system into normal service and tear down the original