Hi Ben, That's right, you'd want to aggregate the collectd stats in one place, and run Nagios on the same system.
In a VM or a dedicated host, it doesn't really matter. There are lots of hosting providers out there, but I would recommend going for one geographically proximate to the machines you're monitoring (same continent is fine). For Australian hosting, I would recommend Bulletproof, but i'm sure plenty of people on this list will have suggestions. Lindsay 2009/6/22 Ben <[email protected]>: > Hi Lindsay, > > Thanks for that comprehensive answer. > > So collectd runs on each system itself, but I assume Nagios is centralised > at some point, so where would be the most sensible place to do that? Is > there ultra reliable hosting built for just that purpose? > > > > 2009/6/22 Lindsay Holmwood <[email protected]> >> >> Hi Ben, >> >> 2009/6/22 [email protected] <[email protected]>: >> > >> > Features: >> > + Email notifications on critical events (that I can specify) >> > + Overview of all systems being monitored showing current status >> > >> > >> > Monitoring: >> > >> > Critical: >> > * status of software RAID6 array (eg. if any drive fails, even if a hot >> > spare is available) >> > * usage % of various partitions >> > * monitor the status of my VMs (I intend to use virtualbox) >> > * monitor the status of backups (haven't yet determined what system I'll >> > be >> > using) >> > >> > Desirable: >> > * monitor my UPS >> > + trigger shutdowns in VMs and then main system if power goes out. >> > >> > Future: >> > * monitor web logs on servers for hits, usage, etc. >> > * monitor security related logs on servers. >> > >> > Will it be simpler to use multiple tools, or is there some giant swiss >> > army >> > knife that it's worth learning? >> >> What you're trying to achieve broadly falls into two categories: >> >> * data collection >> * notification >> >> I find that most of the monitoring tools out there try to do both, and >> don't quite manage to pull it off. >> >> For the data collection, I would recommend using something like >> collectd[0]. It can collect stats on disk space, io throughput, ups >> usage, web server usage (apache2 + nginx), vm utilisation, and a whole >> bunch of other things. It's also network aware, so you can collect >> stats on all your machines individually, and aggregate the results in >> one place. >> >> For the notification, the easiest option would be Nagios[1]. collectd >> provides a collectd-nagios[2] binary which can be used to query stats >> that collectd has collected, and return warnings depending on whether >> values are out of range (which Nagios will pick up and notify you >> about). For quick status checks (questions like "is mdadm reporting >> any failures?"), you can Google for one that suites your taste, or >> write a Nagios check yourself to do it. >> >> The main advantage of breaking the problem up like this is you can >> swap out parts of the system when something better comes along. >> >> Oh, and for triggering shutdowns from your UPS, try something like >> Apcupsd[3]. >> >> Lindsay >> >> [0] http://collectd.org/ >> [1] http://nagios.org/ >> [2] http://collectd.org/documentation/manpages/collectd-nagios.1.shtml >> [3] http://www.apcupsd.com/ >> >> -- >> http://holmwood.id.au/~lindsay/ (me) >> -- >> SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ >> Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html > > -- http://holmwood.id.au/~lindsay/ (me) -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
