Hi Lindsay,

Thanks for that comprehensive answer.

So collectd runs on each system itself, but I assume Nagios is centralised
at some point, so where would be the most sensible place to do that? Is
there ultra reliable hosting built for just that purpose?



2009/6/22 Lindsay Holmwood <[email protected]>

> Hi Ben,
>
> 2009/6/22 [email protected] <[email protected]>:
> >
> > Features:
> >  + Email notifications on critical events (that I can specify)
> >  + Overview of all systems being monitored showing current status
> >
> >
> > Monitoring:
> >
> > Critical:
> > * status of software RAID6 array (eg. if any drive fails, even if a hot
> > spare is available)
> > * usage % of various partitions
> > * monitor the status of my VMs (I intend to use virtualbox)
> > * monitor the status of backups (haven't yet determined what system I'll
> be
> > using)
> >
> > Desirable:
> > * monitor my UPS
> >  + trigger shutdowns in VMs and then main system if power goes out.
> >
> > Future:
> > * monitor web logs on servers for hits, usage, etc.
> > * monitor security related logs on servers.
> >
> > Will it be simpler to use multiple tools, or is there some giant swiss
> army
> > knife that it's worth learning?
>
> What you're trying to achieve broadly falls into two categories:
>
>  * data collection
>  * notification
>
> I find that most of the monitoring tools out there try to do both, and
> don't quite manage to pull it off.
>
> For the data collection, I would recommend using something like
> collectd[0]. It can collect stats on disk space, io throughput, ups
> usage, web server usage (apache2 + nginx), vm utilisation, and a whole
> bunch of other things. It's also network aware, so you can collect
> stats on all your machines individually, and aggregate the results in
> one place.
>
> For the notification, the easiest option would be Nagios[1]. collectd
> provides a collectd-nagios[2] binary which can be used to query stats
> that collectd has collected, and return warnings depending on whether
> values are out of range (which Nagios will pick up and notify you
> about). For quick status checks (questions like "is mdadm reporting
> any failures?"), you can Google for one that suites your taste, or
> write a Nagios check yourself to do it.
>
> The main advantage of breaking the problem up like this is you can
> swap out parts of the system when something better comes along.
>
> Oh, and for triggering shutdowns from your UPS, try something like
> Apcupsd[3].
>
> Lindsay
>
> [0] http://collectd.org/
> [1] http://nagios.org/
> [2] http://collectd.org/documentation/manpages/collectd-nagios.1.shtml
> [3] http://www.apcupsd.com/
>
> --
> http://holmwood.id.au/~lindsay/ <http://holmwood.id.au/%7Elindsay/> (me)
> --
> SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
> Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
>
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Reply via email to