This is an old revision of the document!
I evaluated different monitoring systems over the years, and finally settled upon http://mmonit.com/monit/. Monit is very easy to configure and is capable of reacting to error conditions. For example, if an HTTP server crashes, it can restart the system locally. All monitoring runs on the system to be monitored itself (hence its capability to restart daemons). The full documentation is available here: http://mmonit.com/monit/documentation/monit.html
In addition, by using http://mmonit.com/, one can have a central reporting host that collects the monitoring information. This is a commercial system which we can buy.
The monitoring of my nginx server is configured like this:
check process nginx with pidfile /var/run/nginx.pid start program = "/etc/init.d/nginx start" stop program = "/etc/init.d/nginx stop" group server if failed host www.gonium.net port 80 protocol http and request "/monit/token" then restart if cpu is greater than 60% for 2 cycles then alert if cpu > 80% for 5 cycles then restart if totalmem > 256 MB for 5 cycles then restart if children > 16 then restart if loadavg(5min) greater than 10 for 8 cycles then stop if 3 restarts within 5 cycles then timeout
This configuration snippet checks
TODO: List of daemons to look at, error conditions etc.