Community Based Alerting
There are so many ways to monitor whether a particular application is working properly or not. For the average website - Is the server pingable? Can you open a socket to port 80? Can you do a GET request and get a “200 OK” response?
Some checks are more complicated again - “Does the page ‘/index.php’ include the text ‘Blog’?”. A few even go so far as to simulate multiple page-loads, “clicking” along a path to ensure functionality.
Me, I’m too lazy for that. (Well, sometimes ;). A really sneaky way of checking that your app is working, is to measure the metrics around how much it’s being used! How many concurrent users do you have, compared to normal? How many page/thread/post views per second do you currently have? If your site is broken in any way, these numbers are likely to drop off - even if your checks above find nothing wrong.
Let’s just say that you have an average of 1000 concurrent users. If the site is broken, people won’t hang around. They’ll disappear, coming back later to see if you’ve fixed your problem - so maybe you’ll only have 300 concurrent users. Set up an alert that you must have at least [say] 500 users, and alert if the number drops below that. Bingo, you have a new alert that will catch failures with very little effort on your part.
Naturally, most sites don’t have flat traffic all week long. Maybe your off-peak traffic is only 400 users, and your peak traffic is 1000. There’s no sensible single threshold that can alert you that there’s a problem, while also tolerating the normal fluctuations that you’re likely to see. The solution, for a lot of sites, is as simple as making your threshold into a sine wave. For this example, the function 300*SIN(time())+400
sits at a pretty nice threshold for my sample data:
How you actually tie this into your alerting system, is obviously dependent on your alerting system! Any system that allows custom scripting is likely to support this pretty readily. As a random example, I’ve succesfully tied it into munin’s alerting by adjusting the “users.critical” config fragment to be echo’d out with a dynamic number driven from a formula similar to the above; munin re-read it on each evaluation and quite happily emailed or SMS’d me when things “weren’t quite right”.
False alarms are moderately rare, but certainly do happen - usually I found that things like public holidays (including in other countries :) would trigger it. It’s far from foolproof, but nifty - and pretty darn easy to implement.