Over the years as a developer and ops guy, I’ve had to be on-call a lot. Much of my hate stems from being a part of small teams having to take on monitoring for hundreds of instances without the resources of a NOC or lacking adequate monitoring/alerting software. However, since I like to set the bar high, I still believe that a team of a few admins should be all it takes to react to problems (read: it shouldn’t take a NOC). By creating a very high signal-to-noise ratio, minimal human oversight is required by the tools and everyone gets to sleep better at night.
I’ve seen attempts to have incidents automatically trigger creation of tickets. It fails because the philosophy is not in place: what should be a ticket? what should be an email? what should get auto-resolved? or what counts as a failure? It’s all about cutting out the clutter. To enable ticketing across-the-board is a bad idea; it should be enabled on a on a one-off basis per human-necessitated alert until the noise is silenced. The goal with all operations (especially as it relates to virtualized resources) should be to automate ourselves out of a job. The reality is we’ll never fully accomplish that goal (so we’ll keep our jobs). My feeling is dashboards convey status so much better than individual alerts. Alerts should only call our attention to look at the dashboards.
The only things that should go to email or enter into tickets are issues which are directly and immediately actionable by a human.
For example, alerts that often go to humans, but generally should fix themselves:
- Load spikes (autoscale)
- Disk out of space due to
/tmpfilling up (log rotate)
- Process dies (monit, supervisord)
- Instance dies (terminate, launch new one)
- High swap (only a problem if SLA degraded, probably due to high i/o as a result)
Things that should go to humans (and could probably be tickets):
- Replace disk in server (not disk failure, raid took care of that)
- Resize database server
- MySQL master failure
- Death of any physical hardware
Here’s why I hate singular email alerts: E.g. “CRITICAL: MySQL replication has fallen behind”
Replication doesn’t slow down by itself. It’s because of any number of external factors:
- out of disk space
- high disk i/o (slow disks, too much swap usage, raid failure, mysql backup, other process)
- too slow disks
- blocked queries
- network connectivity/latency problems
- locked tables
Now, all of those things (1-6) should be monitored. If we alerted on each one, however, we would get a barrage emails for every issue (rather than a single incident). It is harder to sift through a stream of emails than it is to view a simple dashboard that shows everything that’s wrong. It’s also much harder to identify how bad a problem is if we have to visually determine severity by the rate of emails. It’s hard to on-board a new hire who has to learn how to read the SnR from emails. Knowing that something is wrong is best determined by the reduction in an SLA, which triggers an alert. That way every time we get an email, we know that a human is needed.