Re: [Nagios-users] Nagios v3.5.0 transitioning immediately to a HARD state upon host problem
> diff -uNp nagios-updated.cfg nagios.cfg > --- nagios-updated.cfg Sat May 25 09:05:09 2013 > +++ nagios.cfg Sat May 25 09:02:37 2013 > @@ -981,9 +981,9 @@ translate_passive_host_checks=0 > > # PASSIVE HOST CHECKS ARE SOFT OPTION > # This determines whether or not Nagios will treat passive host > -# checks as being HARD or SOFT. By default, a single passive host > -# check result will put a host into an immediate HARD state type. > -# This can be changed by enabling this option. > +# checks as being HARD or SOFT. By default, a passive host check > +# result will put a host into a HARD state type. This can be changed > +# by enabling this option. > # Values: 0 = passive checks are HARD, 1 = passive checks are SOFT > > passive_host_checks_are_soft=0 > > > Does that make sense? If I had read something like that, it would > have been immediately clear to me what was happening. > > Thank you so much, Andreas! On to the next problem with the > upgrade (something that can wait until next week)... Sorry, too little caffeine too early, got the files reversed. Here's the right diff: diff -uNp nagios.cfg nagios-updated.cfg --- nagios.cfg Sat May 25 10:25:34 2013 +++ nagios-updated.cfg Sat May 25 10:27:12 2013 @@ -981,9 +981,9 @@ translate_passive_host_checks=0 # PASSIVE HOST CHECKS ARE SOFT OPTION # This determines whether or not Nagios will treat passive host -# checks as being HARD or SOFT. By default, a passive host check -# result will put a host into a HARD state type. This can be changed -# by enabling this option. +# checks as being HARD or SOFT. By default, a single passive host +# check result will put a host into an immediate HARD state type. +# This can be changed by enabling this option. # Values: 0 = passive checks are HARD, 1 = passive checks are SOFT passive_host_checks_are_soft=0 -- "The very existence of flamethrowers proves that sometime, somewhere, someone said to themselves, 'You know, I want to set those people over there on fire, but I'm just not close enough to get the job done.'" -- George Carlin -- Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios v3.5.0 transitioning immediately to a HARD state upon host problem
> On 2013-05-23 17:43, C. Bensend wrote: >> >> Hey folks, >> >> I recently made two major changes to my Nagios environment: >> >> 1) I upgraded to v3.5.0. >> 2) I moved from a single server to two pollers sending passive >> results to one central console server. >> >> Now, this new distributed system was in place for several months >> while I tested, and it worked fine. HOWEVER, since this was running >> in parallel with my production system, notifications were disabled. >> Hence, I didn't see this problem until I cut over for real and >> enabled notifications. >> >> (please excuse any cut-n-paste ugliness, had to send this info from >> my work account via Outlook and then try to cleanse and reformat >> via Squirrelmail) >> >> As a test and to capture information, I reboot 'hostname'. This >> log is from the nagios-console host, which is the host that accepts >> the passive check results and sends notifications. Here is the >> console host receiving a service check failure when the host is >> restarting: >> >> May 22 15:57:10 nagios-console nagios: SERVICE ALERT: hostname;/var disk >> queue;CRITICAL;SOFT;1;Connection refused by host >> >> >> So, the distributed poller system checks the host and sends its >> results to the console server: >> >> May 22 15:57:30 nagios-console nagios: HOST >> ALERT:hostname;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d) >> >> >> And then the centralized server IMMEDIATELY goes into a hard state, >> which triggers a notification: >> >> May 22 15:57:30 nagios-console nagios: HOST ALERT: >> hostname;DOWN;HARD;1;CRITICAL - Host Unreachable (a.b.c.d) >> May 22 15:57:30 nagios-console nagios: HOST NOTIFICATION: >> cbensend;hostname;DOWN;host-notify-by-email-test;CRITICAL - >> Host Unreachable (a.b.c.d) >> >> >> Um. Wat? Why would the console immediately trigger a hard >> state? The config files don't support this decision. And this >> IS a problem with the console server - the distributed monitors >> continue checking the host for 6 times like they should. But >> for some reason, the centralized console just immediately >> calls it a hard state. *snip* > > > Set passive_host_checks_are_soft=1 in nagios.cfg on your master > server and things should start working as intended. > > -- > Andreas Ericsson andreas.erics...@op5.se Oh lord, THANK YOU. That appears to have fixed that problem, which was a pain in the ass. In my defense, I *did* see that option, but the way I interpreted the comments didn't quite match up with the behavior I was seeing. I should have experimented with it, I guess. A slight adjustment to the comments would have thrown a red flag for me - perhaps this is just a matter of personal interpretation, but maybe the comments could be a bit more specific: diff -uNp nagios-updated.cfg nagios.cfg --- nagios-updated.cfg Sat May 25 09:05:09 2013 +++ nagios.cfg Sat May 25 09:02:37 2013 @@ -981,9 +981,9 @@ translate_passive_host_checks=0 # PASSIVE HOST CHECKS ARE SOFT OPTION # This determines whether or not Nagios will treat passive host -# checks as being HARD or SOFT. By default, a single passive host -# check result will put a host into an immediate HARD state type. -# This can be changed by enabling this option. +# checks as being HARD or SOFT. By default, a passive host check +# result will put a host into a HARD state type. This can be changed +# by enabling this option. # Values: 0 = passive checks are HARD, 1 = passive checks are SOFT passive_host_checks_are_soft=0 Does that make sense? If I had read something like that, it would have been immediately clear to me what was happening. Thank you so much, Andreas! On to the next problem with the upgrade (something that can wait until next week)... Benny -- "The very existence of flamethrowers proves that sometime, somewhere, someone said to themselves, 'You know, I want to set those people over there on fire, but I'm just not close enough to get the job done.'" -- George Carlin -- Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios v3.5.0 transitioning immediately to a HARD state upon host problem
On 2013-05-23 17:43, C. Bensend wrote: > > Hey folks, > > I recently made two major changes to my Nagios environment: > > 1) I upgraded to v3.5.0. > 2) I moved from a single server to two pollers sending passive > results to one central console server. > > Now, this new distributed system was in place for several months > while I tested, and it worked fine. HOWEVER, since this was running > in parallel with my production system, notifications were disabled. > Hence, I didn't see this problem until I cut over for real and > enabled notifications. > > (please excuse any cut-n-paste ugliness, had to send this info from > my work account via Outlook and then try to cleanse and reformat > via Squirrelmail) > > As a test and to capture information, I reboot 'hostname'. This > log is from the nagios-console host, which is the host that accepts > the passive check results and sends notifications. Here is the > console host receiving a service check failure when the host is > restarting: > > May 22 15:57:10 nagios-console nagios: SERVICE ALERT: hostname;/var disk > queue;CRITICAL;SOFT;1;Connection refused by host > > > So, the distributed poller system checks the host and sends its > results to the console server: > > May 22 15:57:30 nagios-console nagios: HOST > ALERT:hostname;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d) > > > And then the centralized server IMMEDIATELY goes into a hard state, > which triggers a notification: > > May 22 15:57:30 nagios-console nagios: HOST ALERT: > hostname;DOWN;HARD;1;CRITICAL - Host Unreachable (a.b.c.d) > May 22 15:57:30 nagios-console nagios: HOST NOTIFICATION: > cbensend;hostname;DOWN;host-notify-by-email-test;CRITICAL - > Host Unreachable (a.b.c.d) > > > Um. Wat? Why would the console immediately trigger a hard > state? The config files don't support this decision. And this > IS a problem with the console server - the distributed monitors > continue checking the host for 6 times like they should. But > for some reason, the centralized console just immediately > calls it a hard state. > > Definitions on the distributed monitoring host (the one running > the actual host and service checks for this host 'hostname': > > define host { > host_namehostname > aliasOld production Nagios server > address a.b.c.d > action_url /pnp4nagios/graph?host=$HOSTNAME$ > icon_image_alt Red Hat Linux > icon_image redhat.png > statusmap_image redhat.gd2 > check_commandcheck-host-alive > check_period 24x7 > notification_period 24x7 > contact_groups linux-infrastructure-admins > use linux-host-template > } > > The linux-host-template on that same system: > > define host { > name linux-host-template > register 0 > max_check_attempts 6 > check_interval 5 > retry_interval 1 > notification_interval360 > notification_options d,r > active_checks_enabled1 > passive_checks_enabled 1 > notifications_enabled1 > check_freshness 0 > check_period 24x7 > notification_period 24x7 > check_commandcheck-host-alive > contact_groups linux-infrastructure-admins > } > > And said command to determine up or down: > > define command { > command_name check-host-alive > command_line $USER1$/check_ping -H $HOSTADDRESS$ -w > 5000.0,80% -c 1.0,100% -p 5 > } > > > Definitions on the centralized console host (the one that notifies): > > define host { >host_namehostname >aliasOld production Nagios server >address a.b.c.d >action_url /pnp4nagios/graph?host=$HOSTNAME$ >icon_image_alt Red Hat Linux >icon_image redhat.png >statusmap_image redhat.gd2 >check_commandcheck-host-alive >check_period 24x7 >notification_period 24x7 >contact_groups linux-infrastructure-admins >use linux-host-template,Default_monitor_server > } > > The "Default monitor server" template on the centralized server: > > define host { >name Default_monitor_server >register 0 >active_checks_enabled0 >passive_checks_enabled 1 >notifications_enabled1 >check_freshness 0 >freshness_threshold 86400 > } > > And the linux-host-template template on that same centralized host: > > define host { > namelinux-host-template > register0 > ma
Re: [Nagios-users] Nagios v3.5.0 transitioning immediately to a HARD state upon host problem
> I ran into a similar problem, because my template set the service to "* > is_volatile=1*". > > http://nagios.sourceforge.net/docs/3_0/volatileservices.html Hrmmm. Good point... However, is_volatile does not appear in any of my configuration files, for any of the Nagios servers. It isn't set by default, is it? The Nagios "config.cgi" page doesn't even list it, and livestatus (what I use to query my running daemon) doesn't give it as a column it can query. I can't imagine it's on by default in v3.5.0, but I can't really tell if it is or not. I can try explicitly *disabling* it in all hosts, but I can't really test that at the moment - out of here for a long weekend in a few minutes. If it gets annoying enough over the weekend, I might *have* to test that theory. Thank you very much. I will still appreciate any input others can give on this question - it just doesn't seem to be behaving as it's configured! Benny -- "The very existence of flamethrowers proves that sometime, somewhere, someone said to themselves, 'You know, I want to set those people over there on fire, but I'm just not close enough to get the job done.'" -- George Carlin -- Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios v3.5.0 transitioning immediately to a HARD state upon host problem
I ran into a similar problem, because my template set the service to "* is_volatile=1*". http://nagios.sourceforge.net/docs/3_0/volatileservices.html Check to see if you have this flag enabled. Doug Sincerely, Doug Eubanks ad...@dougware.net K1DUG (919) 201-8750 On Thu, May 23, 2013 at 11:43 AM, C. Bensend wrote: > > Hey folks, > >I recently made two major changes to my Nagios environment: > > 1) I upgraded to v3.5.0. > 2) I moved from a single server to two pollers sending passive >results to one central console server. > >Now, this new distributed system was in place for several months > while I tested, and it worked fine. HOWEVER, since this was running > in parallel with my production system, notifications were disabled. > Hence, I didn't see this problem until I cut over for real and > enabled notifications. > > (please excuse any cut-n-paste ugliness, had to send this info from > my work account via Outlook and then try to cleanse and reformat > via Squirrelmail) > >As a test and to capture information, I reboot 'hostname'. This > log is from the nagios-console host, which is the host that accepts > the passive check results and sends notifications. Here is the > console host receiving a service check failure when the host is > restarting: > > May 22 15:57:10 nagios-console nagios: SERVICE ALERT: hostname;/var disk > queue;CRITICAL;SOFT;1;Connection refused by host > > > So, the distributed poller system checks the host and sends its > results to the console server: > > May 22 15:57:30 nagios-console nagios: HOST > ALERT:hostname;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d) > > > And then the centralized server IMMEDIATELY goes into a hard state, > which triggers a notification: > > May 22 15:57:30 nagios-console nagios: HOST ALERT: > hostname;DOWN;HARD;1;CRITICAL - Host Unreachable (a.b.c.d) > May 22 15:57:30 nagios-console nagios: HOST NOTIFICATION: > cbensend;hostname;DOWN;host-notify-by-email-test;CRITICAL - > Host Unreachable (a.b.c.d) > > >Um. Wat? Why would the console immediately trigger a hard > state? The config files don't support this decision. And this > IS a problem with the console server - the distributed monitors > continue checking the host for 6 times like they should. But > for some reason, the centralized console just immediately > calls it a hard state. > >Definitions on the distributed monitoring host (the one running > the actual host and service checks for this host 'hostname': > > define host { > host_namehostname > aliasOld production Nagios server > address a.b.c.d > action_url /pnp4nagios/graph?host=$HOSTNAME$ > icon_image_alt Red Hat Linux > icon_image redhat.png > statusmap_image redhat.gd2 > check_commandcheck-host-alive > check_period 24x7 > notification_period 24x7 > contact_groups linux-infrastructure-admins > use linux-host-template > } > > The linux-host-template on that same system: > > define host { > name linux-host-template > register 0 > max_check_attempts 6 > check_interval 5 > retry_interval 1 > notification_interval360 > notification_options d,r > active_checks_enabled1 > passive_checks_enabled 1 > notifications_enabled1 > check_freshness 0 > check_period 24x7 > notification_period 24x7 > check_commandcheck-host-alive > contact_groups linux-infrastructure-admins > } > > And said command to determine up or down: > > define command { > command_name check-host-alive > command_line $USER1$/check_ping -H $HOSTADDRESS$ -w > 5000.0,80% -c 1.0,100% -p 5 > } > > > Definitions on the centralized console host (the one that notifies): > > define host { > host_namehostname > aliasOld production Nagios server > address a.b.c.d > action_url /pnp4nagios/graph?host=$HOSTNAME$ > icon_image_alt Red Hat Linux > icon_image redhat.png > statusmap_image redhat.gd2 > check_commandcheck-host-alive > check_period 24x7 > notification_period 24x7 > contact_groups linux-infrastructure-admins > use linux-host-template,Default_monitor_server > } > > The "Default monitor server" template on the centralized server: > > define host { > name Default_monitor_server > register 0 > active_checks_enabled0 > passive_checks_enabled 1 > notifications_enabled1 > check_freshness 0 >
[Nagios-users] Nagios v3.5.0 transitioning immediately to a HARD state upon host problem
Hey folks, I recently made two major changes to my Nagios environment: 1) I upgraded to v3.5.0. 2) I moved from a single server to two pollers sending passive results to one central console server. Now, this new distributed system was in place for several months while I tested, and it worked fine. HOWEVER, since this was running in parallel with my production system, notifications were disabled. Hence, I didn't see this problem until I cut over for real and enabled notifications. (please excuse any cut-n-paste ugliness, had to send this info from my work account via Outlook and then try to cleanse and reformat via Squirrelmail) As a test and to capture information, I reboot 'hostname'. This log is from the nagios-console host, which is the host that accepts the passive check results and sends notifications. Here is the console host receiving a service check failure when the host is restarting: May 22 15:57:10 nagios-console nagios: SERVICE ALERT: hostname;/var disk queue;CRITICAL;SOFT;1;Connection refused by host So, the distributed poller system checks the host and sends its results to the console server: May 22 15:57:30 nagios-console nagios: HOST ALERT:hostname;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d) And then the centralized server IMMEDIATELY goes into a hard state, which triggers a notification: May 22 15:57:30 nagios-console nagios: HOST ALERT: hostname;DOWN;HARD;1;CRITICAL - Host Unreachable (a.b.c.d) May 22 15:57:30 nagios-console nagios: HOST NOTIFICATION: cbensend;hostname;DOWN;host-notify-by-email-test;CRITICAL - Host Unreachable (a.b.c.d) Um. Wat? Why would the console immediately trigger a hard state? The config files don't support this decision. And this IS a problem with the console server - the distributed monitors continue checking the host for 6 times like they should. But for some reason, the centralized console just immediately calls it a hard state. Definitions on the distributed monitoring host (the one running the actual host and service checks for this host 'hostname': define host { host_namehostname aliasOld production Nagios server address a.b.c.d action_url /pnp4nagios/graph?host=$HOSTNAME$ icon_image_alt Red Hat Linux icon_image redhat.png statusmap_image redhat.gd2 check_commandcheck-host-alive check_period 24x7 notification_period 24x7 contact_groups linux-infrastructure-admins use linux-host-template } The linux-host-template on that same system: define host { name linux-host-template register 0 max_check_attempts 6 check_interval 5 retry_interval 1 notification_interval360 notification_options d,r active_checks_enabled1 passive_checks_enabled 1 notifications_enabled1 check_freshness 0 check_period 24x7 notification_period 24x7 check_commandcheck-host-alive contact_groups linux-infrastructure-admins } And said command to determine up or down: define command { command_name check-host-alive command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 5000.0,80% -c 1.0,100% -p 5 } Definitions on the centralized console host (the one that notifies): define host { host_namehostname aliasOld production Nagios server address a.b.c.d action_url /pnp4nagios/graph?host=$HOSTNAME$ icon_image_alt Red Hat Linux icon_image redhat.png statusmap_image redhat.gd2 check_commandcheck-host-alive check_period 24x7 notification_period 24x7 contact_groups linux-infrastructure-admins use linux-host-template,Default_monitor_server } The "Default monitor server" template on the centralized server: define host { name Default_monitor_server register 0 active_checks_enabled0 passive_checks_enabled 1 notifications_enabled1 check_freshness 0 freshness_threshold 86400 } And the linux-host-template template on that same centralized host: define host { namelinux-host-template register0 max_check_attempts 6 check_interval 5 retry_interval 1 notification_interval 360 notification_optionsd,r active_checks_enabled 1 passive_checks_enabled 1 notifications_enabled 1 check_freshness 0 check_period24x7 not