Re: [Nagios-users] Distributed monitoring: central collector doesn't seem to be able to run active checks

2013-08-28 Thread C. Bensend

> On 8/28/13 14:43, C. Bensend wrote:
>> Are you saying I just need gearmand running on the collector?
>
> Well, i assumed it. You are the only one which really can tell that.
> You will need a worker on each host which should run checks. If your
> collector should not run any checks, than no worker is necessary.
>
> See http://labs.consol.de/nagios/mod-gearman/#_common_scenarios for a list
> of common setups.

OK, yes, I grok that.  I guess I would want the collector to be *able*
to run checks, if it doesn't get timely information from the pollers.
I'm assuming that's why it's even trying in the first place - it
doesn't see a result in a timely manner, so it thinks it should run
one.

Which circles back to my original question - why can't it run the
check?  Why isn't it finding what it needs to find?  The workers
are running as the nagios user, and I don't see anything that appears
pertinent in the mod_gearman_worker.conf file...  What am I missing?
Neither the gearmand.log nor the mod_gearman_worker.log files seem
to have any complaints (but I haven't bumped up the debug on them yet).

Thanks so much for your help!

Benny


-- 
"No matter how tempted I am with the prospect of unlimited power, I
will not consume any energy field bigger than my head."
  -- #22 on Peter Anspach's Evil
 Overlord list


--
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Distributed monitoring: central collector doesn't seem to be able to run active checks

2013-08-28 Thread Sven Nierlein
On 8/28/13 14:43, C. Bensend wrote:
> Are you saying I just need gearmand running on the collector?

Well, i assumed it. You are the only one which really can tell that.
You will need a worker on each host which should run checks. If your
collector should not run any checks, than no worker is necessary.

See http://labs.consol.de/nagios/mod-gearman/#_common_scenarios for a list
of common setups.

  Sven

--
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Distributed monitoring: central collector doesn't seem to be able to run active checks

2013-08-28 Thread C. Bensend

> On 8/22/13 13:51, C. Bensend wrote:
>> CRITICAL: Return code of 127 is out of bounds. Make sure the plugin
>> youre trying to run actually exists. (worker: collector.domain.org)
>
> Hi,
>
> if this is the collector host, why does it have a mod-gearman worker
> installed? If nagios would have
> run the check by itself, there would be no hint about the worker in the
> error. So it seems like there
> is a worker started on your collector host which then grabs some checks
> but isn't able to execute them.

Oh ho!  I have multiple *gearman* processes running:

ps axuww | grep gearman
gearmand  5662  0.7  0.1 404672  2496 ?Ssl  Aug17 118:29
/usr/sbin/gearmand -d -l /var/log/gearmand/gearmand.log
nagios5712  0.0  0.0  38024   640 ?Ss   Aug17   1:03
/usr/bin/mod_gearman_worker -d
--config=/etc/mod_gearman/mod_gearman_worker.conf
--pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios   25919  0.0  0.1 137492  3016 ?S07:38   0:00
/usr/bin/mod_gearman_worker -d
--config=/etc/mod_gearman/mod_gearman_worker.conf
--pidfile=/var/mod_gearman/mod_gearman_worker.pid

.. etc ..

Are you saying I just need gearmand running on the collector?  I'm
quite new to gearman, so I might have misunderstood which parts are
necessary where.  I can easily shut down the mod_gearman_worker
service, I just need to understand the consequences.

I assumed that this was a Nagios error - perhaps I just have my
gearman setup configured wrong.

Benny


-- 
"No matter how tempted I am with the prospect of unlimited power, I
will not consume any energy field bigger than my head."
  -- #22 on Peter Anspach's Evil
 Overlord list


--
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Distributed monitoring: central collector doesn't seem to be able to run active checks

2013-08-28 Thread Sven Nierlein
On 8/22/13 13:51, C. Bensend wrote:
> CRITICAL: Return code of 127 is out of bounds. Make sure the plugin
> youre trying to run actually exists. (worker: collector.domain.org)

Hi,

if this is the collector host, why does it have a mod-gearman worker installed? 
If nagios would have
run the check by itself, there would be no hint about the worker in the error. 
So it seems like there
is a worker started on your collector host which then grabs some checks but 
isn't able to execute them.

Regards,
  Sven


-- 
Sven Nierlein sven.nierl...@consol.de
ConSol* GmbH  http://www.consol.de
Franziskanerstrasse 38Tel.:089/45841-439
81669 MuenchenFax.:089/45841-111


--
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Distributed monitoring: central collector doesn't seem to be able to run active checks

2013-08-28 Thread C. Bensend

> Do you get many of those error messages in the logs at once, or just
> one at a time?
>
> Only one thought: what are the permissions on your $USER$ variables?
> Nagios on my systems setuid() to nonroot after startup, and if it gets
> SIGHUP to reload config, but can't read the file defining $USER*$,
> will act strangely.

Just one at a time, seemingly randomly.  A host here, a service there,
several times a day.  They always almost immediately recover, but I
don't understand why my centralized collector seems to have this issue.

Nagios runs as the nagios user, which can read the resource.cfg file
fine:

ls -ld . ; ls -l nagios-hostname.cfg resource.cfg
drwxrwx--- 6 root nagios 4096 Aug 27 16:02 .
-rw-r--r-- 1 root root   47606 Jul  1 11:18 nagios-hostname.cfg
-rw-r- 1 root nagios  2400 Mar 19 11:25 resource.cfg

Thanks!


-- 
"No matter how tempted I am with the prospect of unlimited power, I
will not consume any energy field bigger than my head."
  -- #22 on Peter Anspach's Evil
 Overlord list


--
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Distributed monitoring: central collector doesn't seem to be able to run active checks

2013-08-28 Thread Justin Pryzby
Do you get many of those error messages in the logs at once, or just
one at a time?

Only one thought: what are the permissions on your $USER$ variables?
Nagios on my systems setuid() to nonroot after startup, and if it gets
SIGHUP to reload config, but can't read the file defining $USER*$,
will act strangely.

Justin

On Wed, Aug 28, 2013 at 06:48:09AM -0500, C. Bensend wrote:
> 
> >I'm continuing to iron out the wrinkles with 3.5.1 and distributed
> > monitoring.  I'm using mod_gearman to submit and receive events from
> > two distributed pollers.
> >
> >Every now and again, I'll get something similar in the log on the
> > centralized collecting machine:
> >
> > CRITICAL: Return code of 127 is out of bounds. Make sure the plugin
> > youre trying to run actually exists. (worker: collector.domain.org)
> >
> >To me, that suggests that the collector system didn't get a result
> > for a host or service in a timely manner from one of the polling
> > systems, and so it attempted to run an active check itself.  However,
> > it doesn't seem to be able to, and I don't know why.
> >
> >The collector has the same value for $USER1$, and it has the same
> > set of plugins installed on it:
> >
> > On the collector:
> >
> > grep USER1 etc/resource.cfg
> > $USER1$=/usr/local/nagios/libexec
> >
> > On the two pollers:
> >
> > $USER1$=/usr/local/nagios/libexec
> > $USER1$=/usr/local/nagios/libexec
> >
> >The plugins are installed in identical locations on all three systems,
> > that's enforced via Puppet.  The 'nagios' user can find and run them on
> > the collector:
> >
> > /usr/local/nagios/libexec/check_nrpe -H 127.0.0.1
> > NRPE v2.13
> >
> >Now, because this is a distributed setup, the collector system is
> > not configured to run active checks:
> >
> > grep ^execute etc/nagios.cfg
> > execute_service_checks=0
> > execute_host_checks=0
> >
> >... but *obviously* it's trying to.  Is it failing because it's
> > configured to not run them?  If that's the case, the error message is
> > not accurate and should be corrected.  If that's *not* the case, why
> > can't my collector server run an active check when it believes it needs
> > to?
> >
> >I use NConf to generate my configurations, if that matters.  There are
> > a *lot* of hosts/services and quite a few configuration files, so I'm not
> > going to paste a slew of information here.  If I'm missing pertinent
> > information, please let me know exactly what you want to see and I'll
> > get it.
> 
> No one has an idea about this?  And no, Andreas, I can't move to
> 4.0 yet.  ;)
> 
> Thanks!
> 
> Benny
> 
> 
> -- 
> "No matter how tempted I am with the prospect of unlimited power, I
> will not consume any energy field bigger than my head."
>   -- #22 on Peter Anspach's Evil
>  Overlord list
> 
> 
> --
> Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
> Discover the easy way to master current and previous Microsoft technologies
> and advance your career. Get an incredible 1,500+ hours of step-by-step
> tutorial videos with LearnDevNow. Subscribe today and save!
> http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
> ___
> Nagios-users mailing list
> Nagios-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting 
> any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null
> 

--
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Distributed monitoring: central collector doesn't seem to be able to run active checks

2013-08-28 Thread C. Bensend

>I'm continuing to iron out the wrinkles with 3.5.1 and distributed
> monitoring.  I'm using mod_gearman to submit and receive events from
> two distributed pollers.
>
>Every now and again, I'll get something similar in the log on the
> centralized collecting machine:
>
> CRITICAL: Return code of 127 is out of bounds. Make sure the plugin
> youre trying to run actually exists. (worker: collector.domain.org)
>
>To me, that suggests that the collector system didn't get a result
> for a host or service in a timely manner from one of the polling
> systems, and so it attempted to run an active check itself.  However,
> it doesn't seem to be able to, and I don't know why.
>
>The collector has the same value for $USER1$, and it has the same
> set of plugins installed on it:
>
> On the collector:
>
> grep USER1 etc/resource.cfg
> $USER1$=/usr/local/nagios/libexec
>
> On the two pollers:
>
> $USER1$=/usr/local/nagios/libexec
> $USER1$=/usr/local/nagios/libexec
>
>The plugins are installed in identical locations on all three systems,
> that's enforced via Puppet.  The 'nagios' user can find and run them on
> the collector:
>
> /usr/local/nagios/libexec/check_nrpe -H 127.0.0.1
> NRPE v2.13
>
>Now, because this is a distributed setup, the collector system is
> not configured to run active checks:
>
> grep ^execute etc/nagios.cfg
> execute_service_checks=0
> execute_host_checks=0
>
>... but *obviously* it's trying to.  Is it failing because it's
> configured to not run them?  If that's the case, the error message is
> not accurate and should be corrected.  If that's *not* the case, why
> can't my collector server run an active check when it believes it needs
> to?
>
>I use NConf to generate my configurations, if that matters.  There are
> a *lot* of hosts/services and quite a few configuration files, so I'm not
> going to paste a slew of information here.  If I'm missing pertinent
> information, please let me know exactly what you want to see and I'll
> get it.

No one has an idea about this?  And no, Andreas, I can't move to
4.0 yet.  ;)

Thanks!

Benny


-- 
"No matter how tempted I am with the prospect of unlimited power, I
will not consume any energy field bigger than my head."
  -- #22 on Peter Anspach's Evil
 Overlord list


--
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null