Hi Lindsay, Thanks for that comprehensive answer.
So collectd runs on each system itself, but I assume Nagios is centralised at some point, so where would be the most sensible place to do that? Is there ultra reliable hosting built for just that purpose? 2009/6/22 Lindsay Holmwood <[email protected]> > Hi Ben, > > 2009/6/22 [email protected] <[email protected]>: > > > > Features: > > + Email notifications on critical events (that I can specify) > > + Overview of all systems being monitored showing current status > > > > > > Monitoring: > > > > Critical: > > * status of software RAID6 array (eg. if any drive fails, even if a hot > > spare is available) > > * usage % of various partitions > > * monitor the status of my VMs (I intend to use virtualbox) > > * monitor the status of backups (haven't yet determined what system I'll > be > > using) > > > > Desirable: > > * monitor my UPS > > + trigger shutdowns in VMs and then main system if power goes out. > > > > Future: > > * monitor web logs on servers for hits, usage, etc. > > * monitor security related logs on servers. > > > > Will it be simpler to use multiple tools, or is there some giant swiss > army > > knife that it's worth learning? > > What you're trying to achieve broadly falls into two categories: > > * data collection > * notification > > I find that most of the monitoring tools out there try to do both, and > don't quite manage to pull it off. > > For the data collection, I would recommend using something like > collectd[0]. It can collect stats on disk space, io throughput, ups > usage, web server usage (apache2 + nginx), vm utilisation, and a whole > bunch of other things. It's also network aware, so you can collect > stats on all your machines individually, and aggregate the results in > one place. > > For the notification, the easiest option would be Nagios[1]. collectd > provides a collectd-nagios[2] binary which can be used to query stats > that collectd has collected, and return warnings depending on whether > values are out of range (which Nagios will pick up and notify you > about). For quick status checks (questions like "is mdadm reporting > any failures?"), you can Google for one that suites your taste, or > write a Nagios check yourself to do it. > > The main advantage of breaking the problem up like this is you can > swap out parts of the system when something better comes along. > > Oh, and for triggering shutdowns from your UPS, try something like > Apcupsd[3]. > > Lindsay > > [0] http://collectd.org/ > [1] http://nagios.org/ > [2] http://collectd.org/documentation/manpages/collectd-nagios.1.shtml > [3] http://www.apcupsd.com/ > > -- > http://holmwood.id.au/~lindsay/ <http://holmwood.id.au/%7Elindsay/> (me) > -- > SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ > Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html > -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
