[Nagios-users] Monitoring large (ish) numbers of servers with exceptions to the rules...

2008-06-17 Thread Matthew Macdonald-Wallace
Hi All,

I currently help maintain and monitor around 50 servers across various
parts of the UK using Nagios 2.  At the moment, we have a configuration
file for each host (%hostname%.cfg) and in that file we specify all the
services for the named host.

We are trying to reduce the number of configuration files as we take on
more and more servers becuase there are a large number checks that we
need to be rolled out to all servers and we feel that we are
duplicating our workload.

I'm open to ideas on how to achieve this however my thoughts were a
setup along the lines of the following:

 - A master host template is created in which all services are defined
   for a host.

 - If a check does not need to be run for a given host (for example it
   is not a web server), a stanza is added to that particular host's
   config file that effectively tells nagios don't check for this
   service on this host

I've tried defining all the services in a master templates file and
this works perfectly however when I come to exclude certain services, I
am at a loss on how to do it.

Initially I tried adding a stanza with the same service name and
register 0 as one of the options, however this didn't work.

We have used HostGroups in the past to achieve a similar goal, however
we ran into the issue that whilst we need to check the CPU Usage on all
of the servers, a few of the servers that we monitor can take a lot
more of a beating than the majority.  This lead to us defining the CPU
checks on a per-host basis as if we defined it separately from the
hostgroup for the more powerful servers we we presented with a load of
errors regarding duplicate service names.

I hope I've made myself clear on what we're after and I look forward to
receiving your input on this.

Kind regards,

Matt
-- 
Matt Wallace
[EMAIL PROTECTED]
http://www.truthisfreedom.org.uk/

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Monitoring large (ish) numbers of servers with exceptions to the rules...

2008-06-17 Thread Wheeler, JF (Jonathan)
 -Original Message-
 From: nagios-users On Behalf Of Matthew Macdonald-Wallace
 Sent: 17 June 2008 13:14
 
 I currently help maintain and monitor around 50 servers across various
 parts of the UK using Nagios 2.  At the moment, we have a
configuration
 file for each host (%hostname%.cfg) and in that file we specify all
the
 services for the named host.
 
 We are trying to reduce the number of configuration files as we take
on
 more and more servers because there are a large number checks that we
 need to be rolled out to all servers and we feel that we are
 duplicating our workload.
 
 I'm open to ideas on how to achieve this however my thoughts were a
 setup along the lines of the following:
 
  - A master host template is created in which all services are
defined
for a host.
 
  - If a check does not need to be run for a given host (for example it
is not a web server), a stanza is added to that particular host's
config file that effectively tells nagios don't check for this
service on this host
 
 I've tried defining all the services in a master templates file and
 this works perfectly however when I come to exclude certain services,
I
 am at a loss on how to do it.
 
 Initially I tried adding a stanza with the same service name and
 register 0 as one of the options, however this didn't work.
 
 We have used HostGroups in the past to achieve a similar goal, however
 we ran into the issue that whilst we need to check the CPU Usage on
all
 of the servers, a few of the servers that we monitor can take a lot
 more of a beating than the majority.  This lead to us defining the CPU
 checks on a per-host basis as if we defined it separately from the
 hostgroup for the more powerful servers we presented with a load of
 errors regarding duplicate service names.
 
 I hope I've made myself clear on what we're after and I look forward
to
 receiving your input on this.

One thing that I use in the configuration that I maintain is to have
something like this:

define service{
use generic-hung-mounts
hostgroup_name  experiments
hosts   !lfc0448
contact_groups  experiments
}

where lcg0448 is a host in host group experiments and I want to
apply the generic-hung-mounts check to all hosts in that group except
for lcg0448.

This can lead to configuration like this:

define service{
use check-pbs-offline
hostgroup_name  workers
hosts   !lcg0614,!lcg0617,!lcg0618,!lcg0626
contact_groups  tier1a
}
define service{
use check-pbs-offline
hosts   lcg0614,lcg0617,lcg0618,lcg0626
contact_groups  tier1a,grid-team
}

where the only difference is that the hosts in the second definition
have a second contact group.

HTH

Jonathan Wheeler
e-Science Centre
Rutherford Appleton Laboratory

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Monitoring large (ish) numbers of servers with exceptions to the rules...

2008-06-17 Thread Anthony Montibello
Hi,

Using REgExp and Object Templats is a key for optimizing maintenance.

I read some good details on handling what needs to be configured and what
can be inhereted and automatically associated in the current Nagios 3
Documentation.   I think much of the framework was in Nagios 2, but the
documentaiton is a bit easier to read in nagios 3 so look at that for some
tips. then check the nagios 2 docs to see if the option is also in there.

A few years ago I converted a nagios 1.2 were all hosts and services were
defined in a single to file to a scalable configuration similar to what was
initialy described here.

I found that if you have a need of suporting different clients with daily
changes it was convient to have one Config directory for each clinet then in
that directory have a single host file, and for each host a seperate Config
file.

on a host being removed it is just a matter of removing it from the Host
file configuration and renaming its Config file.
on adding a new host is was only adding it to the host file, then adding
copy an existing service file and then cut and past to get all the services
defined.

then maintain the entire directory substructer through CVS or some other
version controle.
This as noted does get tedious to maintain, but it alows for customization
of services per host without much thinking.
The Disadvantage of this is the time involved for maintaining,  when there
are few changes getting made.

OTHER options using templates work well,
setting up Inheritance, using REG EXP as well as , other techniques using
HostGroups all assist with orginizing the files but depending on skill
levels  somtimes lead to less readability (Whle for other admins it would
lead to easier maintenance)

Hope this helps,


On Tue, Jun 17, 2008 at 8:22 AM, Wheeler, JF (Jonathan) 
[EMAIL PROTECTED] wrote:

  -Original Message-
  From: nagios-users On Behalf Of Matthew Macdonald-Wallace
  Sent: 17 June 2008 13:14
 
  I currently help maintain and monitor around 50 servers across various
  parts of the UK using Nagios 2.  At the moment, we have a
 configuration
  file for each host (%hostname%.cfg) and in that file we specify all
 the
  services for the named host.
 
  We are trying to reduce the number of configuration files as we take
 on
  more and more servers because there are a large number checks that we
  need to be rolled out to all servers and we feel that we are
  duplicating our workload.
 
  I'm open to ideas on how to achieve this however my thoughts were a
  setup along the lines of the following:
 
   - A master host template is created in which all services are
 defined
 for a host.
 
   - If a check does not need to be run for a given host (for example it
 is not a web server), a stanza is added to that particular host's
 config file that effectively tells nagios don't check for this
 service on this host
 
  I've tried defining all the services in a master templates file and
  this works perfectly however when I come to exclude certain services,
 I
  am at a loss on how to do it.
 
  Initially I tried adding a stanza with the same service name and
  register 0 as one of the options, however this didn't work.
 
  We have used HostGroups in the past to achieve a similar goal, however
  we ran into the issue that whilst we need to check the CPU Usage on
 all
  of the servers, a few of the servers that we monitor can take a lot
  more of a beating than the majority.  This lead to us defining the CPU
  checks on a per-host basis as if we defined it separately from the
  hostgroup for the more powerful servers we presented with a load of
  errors regarding duplicate service names.
 
  I hope I've made myself clear on what we're after and I look forward
 to
  receiving your input on this.

 One thing that I use in the configuration that I maintain is to have
 something like this:

 define service{
use generic-hung-mounts
hostgroup_name  experiments
hosts   !lfc0448
contact_groups  experiments
 }

 where lcg0448 is a host in host group experiments and I want to
 apply the generic-hung-mounts check to all hosts in that group except
 for lcg0448.

 This can lead to configuration like this:

 define service{
use check-pbs-offline
hostgroup_name  workers
hosts   !lcg0614,!lcg0617,!lcg0618,!lcg0626
contact_groups  tier1a
 }
 define service{
use check-pbs-offline
hosts   lcg0614,lcg0617,lcg0618,lcg0626
contact_groups  tier1a,grid-team
 }

 where the only difference is that the hosts in the second definition
 have a second contact group.

 HTH

 Jonathan Wheeler
 e-Science Centre
 Rutherford Appleton Laboratory

 -
 Check out the new SourceForge.net Marketplace.
 It's the