Re: [Nagios-users] Distributed monitoring: central collector doesn't seem to be able to run active checks
> On 8/28/13 14:43, C. Bensend wrote: >> Are you saying I just need gearmand running on the collector? > > Well, i assumed it. You are the only one which really can tell that. > You will need a worker on each host which should run checks. If your > collector should not run any checks, than no worker is necessary. > > See http://labs.consol.de/nagios/mod-gearman/#_common_scenarios for a list > of common setups. OK, yes, I grok that. I guess I would want the collector to be *able* to run checks, if it doesn't get timely information from the pollers. I'm assuming that's why it's even trying in the first place - it doesn't see a result in a timely manner, so it thinks it should run one. Which circles back to my original question - why can't it run the check? Why isn't it finding what it needs to find? The workers are running as the nagios user, and I don't see anything that appears pertinent in the mod_gearman_worker.conf file... What am I missing? Neither the gearmand.log nor the mod_gearman_worker.log files seem to have any complaints (but I haven't bumped up the debug on them yet). Thanks so much for your help! Benny -- "No matter how tempted I am with the prospect of unlimited power, I will not consume any energy field bigger than my head." -- #22 on Peter Anspach's Evil Overlord list -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Distributed monitoring: central collector doesn't seem to be able to run active checks
On 8/28/13 14:43, C. Bensend wrote: > Are you saying I just need gearmand running on the collector? Well, i assumed it. You are the only one which really can tell that. You will need a worker on each host which should run checks. If your collector should not run any checks, than no worker is necessary. See http://labs.consol.de/nagios/mod-gearman/#_common_scenarios for a list of common setups. Sven -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Distributed monitoring: central collector doesn't seem to be able to run active checks
> On 8/22/13 13:51, C. Bensend wrote: >> CRITICAL: Return code of 127 is out of bounds. Make sure the plugin >> youre trying to run actually exists. (worker: collector.domain.org) > > Hi, > > if this is the collector host, why does it have a mod-gearman worker > installed? If nagios would have > run the check by itself, there would be no hint about the worker in the > error. So it seems like there > is a worker started on your collector host which then grabs some checks > but isn't able to execute them. Oh ho! I have multiple *gearman* processes running: ps axuww | grep gearman gearmand 5662 0.7 0.1 404672 2496 ?Ssl Aug17 118:29 /usr/sbin/gearmand -d -l /var/log/gearmand/gearmand.log nagios5712 0.0 0.0 38024 640 ?Ss Aug17 1:03 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid nagios 25919 0.0 0.1 137492 3016 ?S07:38 0:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid .. etc .. Are you saying I just need gearmand running on the collector? I'm quite new to gearman, so I might have misunderstood which parts are necessary where. I can easily shut down the mod_gearman_worker service, I just need to understand the consequences. I assumed that this was a Nagios error - perhaps I just have my gearman setup configured wrong. Benny -- "No matter how tempted I am with the prospect of unlimited power, I will not consume any energy field bigger than my head." -- #22 on Peter Anspach's Evil Overlord list -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Distributed monitoring: central collector doesn't seem to be able to run active checks
On 8/22/13 13:51, C. Bensend wrote: > CRITICAL: Return code of 127 is out of bounds. Make sure the plugin > youre trying to run actually exists. (worker: collector.domain.org) Hi, if this is the collector host, why does it have a mod-gearman worker installed? If nagios would have run the check by itself, there would be no hint about the worker in the error. So it seems like there is a worker started on your collector host which then grabs some checks but isn't able to execute them. Regards, Sven -- Sven Nierlein sven.nierl...@consol.de ConSol* GmbH http://www.consol.de Franziskanerstrasse 38Tel.:089/45841-439 81669 MuenchenFax.:089/45841-111 -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Distributed monitoring: central collector doesn't seem to be able to run active checks
> Do you get many of those error messages in the logs at once, or just > one at a time? > > Only one thought: what are the permissions on your $USER$ variables? > Nagios on my systems setuid() to nonroot after startup, and if it gets > SIGHUP to reload config, but can't read the file defining $USER*$, > will act strangely. Just one at a time, seemingly randomly. A host here, a service there, several times a day. They always almost immediately recover, but I don't understand why my centralized collector seems to have this issue. Nagios runs as the nagios user, which can read the resource.cfg file fine: ls -ld . ; ls -l nagios-hostname.cfg resource.cfg drwxrwx--- 6 root nagios 4096 Aug 27 16:02 . -rw-r--r-- 1 root root 47606 Jul 1 11:18 nagios-hostname.cfg -rw-r- 1 root nagios 2400 Mar 19 11:25 resource.cfg Thanks! -- "No matter how tempted I am with the prospect of unlimited power, I will not consume any energy field bigger than my head." -- #22 on Peter Anspach's Evil Overlord list -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Distributed monitoring: central collector doesn't seem to be able to run active checks
Do you get many of those error messages in the logs at once, or just one at a time? Only one thought: what are the permissions on your $USER$ variables? Nagios on my systems setuid() to nonroot after startup, and if it gets SIGHUP to reload config, but can't read the file defining $USER*$, will act strangely. Justin On Wed, Aug 28, 2013 at 06:48:09AM -0500, C. Bensend wrote: > > >I'm continuing to iron out the wrinkles with 3.5.1 and distributed > > monitoring. I'm using mod_gearman to submit and receive events from > > two distributed pollers. > > > >Every now and again, I'll get something similar in the log on the > > centralized collecting machine: > > > > CRITICAL: Return code of 127 is out of bounds. Make sure the plugin > > youre trying to run actually exists. (worker: collector.domain.org) > > > >To me, that suggests that the collector system didn't get a result > > for a host or service in a timely manner from one of the polling > > systems, and so it attempted to run an active check itself. However, > > it doesn't seem to be able to, and I don't know why. > > > >The collector has the same value for $USER1$, and it has the same > > set of plugins installed on it: > > > > On the collector: > > > > grep USER1 etc/resource.cfg > > $USER1$=/usr/local/nagios/libexec > > > > On the two pollers: > > > > $USER1$=/usr/local/nagios/libexec > > $USER1$=/usr/local/nagios/libexec > > > >The plugins are installed in identical locations on all three systems, > > that's enforced via Puppet. The 'nagios' user can find and run them on > > the collector: > > > > /usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 > > NRPE v2.13 > > > >Now, because this is a distributed setup, the collector system is > > not configured to run active checks: > > > > grep ^execute etc/nagios.cfg > > execute_service_checks=0 > > execute_host_checks=0 > > > >... but *obviously* it's trying to. Is it failing because it's > > configured to not run them? If that's the case, the error message is > > not accurate and should be corrected. If that's *not* the case, why > > can't my collector server run an active check when it believes it needs > > to? > > > >I use NConf to generate my configurations, if that matters. There are > > a *lot* of hosts/services and quite a few configuration files, so I'm not > > going to paste a slew of information here. If I'm missing pertinent > > information, please let me know exactly what you want to see and I'll > > get it. > > No one has an idea about this? And no, Andreas, I can't move to > 4.0 yet. ;) > > Thanks! > > Benny > > > -- > "No matter how tempted I am with the prospect of unlimited power, I > will not consume any energy field bigger than my head." > -- #22 on Peter Anspach's Evil > Overlord list > > > -- > Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! > Discover the easy way to master current and previous Microsoft technologies > and advance your career. Get an incredible 1,500+ hours of step-by-step > tutorial videos with LearnDevNow. Subscribe today and save! > http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk > ___ > Nagios-users mailing list > Nagios-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nagios-users > ::: Please include Nagios version, plugin version (-v) and OS when reporting > any issue. > ::: Messages without supporting info will risk being sent to /dev/null > -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Distributed monitoring: central collector doesn't seem to be able to run active checks
>I'm continuing to iron out the wrinkles with 3.5.1 and distributed > monitoring. I'm using mod_gearman to submit and receive events from > two distributed pollers. > >Every now and again, I'll get something similar in the log on the > centralized collecting machine: > > CRITICAL: Return code of 127 is out of bounds. Make sure the plugin > youre trying to run actually exists. (worker: collector.domain.org) > >To me, that suggests that the collector system didn't get a result > for a host or service in a timely manner from one of the polling > systems, and so it attempted to run an active check itself. However, > it doesn't seem to be able to, and I don't know why. > >The collector has the same value for $USER1$, and it has the same > set of plugins installed on it: > > On the collector: > > grep USER1 etc/resource.cfg > $USER1$=/usr/local/nagios/libexec > > On the two pollers: > > $USER1$=/usr/local/nagios/libexec > $USER1$=/usr/local/nagios/libexec > >The plugins are installed in identical locations on all three systems, > that's enforced via Puppet. The 'nagios' user can find and run them on > the collector: > > /usr/local/nagios/libexec/check_nrpe -H 127.0.0.1 > NRPE v2.13 > >Now, because this is a distributed setup, the collector system is > not configured to run active checks: > > grep ^execute etc/nagios.cfg > execute_service_checks=0 > execute_host_checks=0 > >... but *obviously* it's trying to. Is it failing because it's > configured to not run them? If that's the case, the error message is > not accurate and should be corrected. If that's *not* the case, why > can't my collector server run an active check when it believes it needs > to? > >I use NConf to generate my configurations, if that matters. There are > a *lot* of hosts/services and quite a few configuration files, so I'm not > going to paste a slew of information here. If I'm missing pertinent > information, please let me know exactly what you want to see and I'll > get it. No one has an idea about this? And no, Andreas, I can't move to 4.0 yet. ;) Thanks! Benny -- "No matter how tempted I am with the prospect of unlimited power, I will not consume any energy field bigger than my head." -- #22 on Peter Anspach's Evil Overlord list -- Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null