Hi Ben,
That's right, you'd want to aggregate the collectd stats in one place,
and run Nagios on the same system.

In a VM or a dedicated host, it doesn't really matter. There are lots
of hosting providers out there, but I would recommend going for one
geographically proximate to the machines you're monitoring (same
continent is fine).

For Australian hosting, I would recommend Bulletproof, but i'm sure
plenty of people on this list will have suggestions.

Lindsay

2009/6/22 Ben <[email protected]>:
> Hi Lindsay,
>
> Thanks for that comprehensive answer.
>
> So collectd runs on each system itself, but I assume Nagios is centralised
> at some point, so where would be the most sensible place to do that? Is
> there ultra reliable hosting built for just that purpose?
>
>
>
> 2009/6/22 Lindsay Holmwood <[email protected]>
>>
>> Hi Ben,
>>
>> 2009/6/22 [email protected] <[email protected]>:
>> >
>> > Features:
>> >  + Email notifications on critical events (that I can specify)
>> >  + Overview of all systems being monitored showing current status
>> >
>> >
>> > Monitoring:
>> >
>> > Critical:
>> > * status of software RAID6 array (eg. if any drive fails, even if a hot
>> > spare is available)
>> > * usage % of various partitions
>> > * monitor the status of my VMs (I intend to use virtualbox)
>> > * monitor the status of backups (haven't yet determined what system I'll
>> > be
>> > using)
>> >
>> > Desirable:
>> > * monitor my UPS
>> >  + trigger shutdowns in VMs and then main system if power goes out.
>> >
>> > Future:
>> > * monitor web logs on servers for hits, usage, etc.
>> > * monitor security related logs on servers.
>> >
>> > Will it be simpler to use multiple tools, or is there some giant swiss
>> > army
>> > knife that it's worth learning?
>>
>> What you're trying to achieve broadly falls into two categories:
>>
>>  * data collection
>>  * notification
>>
>> I find that most of the monitoring tools out there try to do both, and
>> don't quite manage to pull it off.
>>
>> For the data collection, I would recommend using something like
>> collectd[0]. It can collect stats on disk space, io throughput, ups
>> usage, web server usage (apache2 + nginx), vm utilisation, and a whole
>> bunch of other things. It's also network aware, so you can collect
>> stats on all your machines individually, and aggregate the results in
>> one place.
>>
>> For the notification, the easiest option would be Nagios[1]. collectd
>> provides a collectd-nagios[2] binary which can be used to query stats
>> that collectd has collected, and return warnings depending on whether
>> values are out of range (which Nagios will pick up and notify you
>> about). For quick status checks (questions like "is mdadm reporting
>> any failures?"), you can Google for one that suites your taste, or
>> write a Nagios check yourself to do it.
>>
>> The main advantage of breaking the problem up like this is you can
>> swap out parts of the system when something better comes along.
>>
>> Oh, and for triggering shutdowns from your UPS, try something like
>> Apcupsd[3].
>>
>> Lindsay
>>
>> [0] http://collectd.org/
>> [1] http://nagios.org/
>> [2] http://collectd.org/documentation/manpages/collectd-nagios.1.shtml
>> [3] http://www.apcupsd.com/
>>
>> --
>> http://holmwood.id.au/~lindsay/ (me)
>> --
>> SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
>> Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
>
>



-- 
http://holmwood.id.au/~lindsay/ (me)
--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Reply via email to