Re: [Nagios-users] Nagios v3.5.0 transitioning immediately to a HARD state upon host problem

2013-05-25 Thread C. Bensend

 On 2013-05-23 17:43, C. Bensend wrote:

 Hey folks,

 I recently made two major changes to my Nagios environment:

 1) I upgraded to v3.5.0.
 2) I moved from a single server to two pollers sending passive
 results to one central console server.

 Now, this new distributed system was in place for several months
 while I tested, and it worked fine.  HOWEVER, since this was running
 in parallel with my production system, notifications were disabled.
 Hence, I didn't see this problem until I cut over for real and
 enabled notifications.

 (please excuse any cut-n-paste ugliness, had to send this info from
 my work account via Outlook and then try to cleanse and reformat
 via Squirrelmail)

 As a test and to capture information, I reboot 'hostname'.  This
 log is from the nagios-console host, which is the host that accepts
 the passive check results and sends notifications.  Here is the
 console host receiving a service check failure when the host is
 restarting:

 May 22 15:57:10 nagios-console nagios: SERVICE ALERT: hostname;/var disk
 queue;CRITICAL;SOFT;1;Connection refused by host


 So, the distributed poller system checks the host and sends its
 results to the console server:

 May 22 15:57:30 nagios-console nagios: HOST
 ALERT:hostname;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d)


 And then the centralized server IMMEDIATELY goes into a hard state,
 which triggers a  notification:

 May 22 15:57:30 nagios-console nagios: HOST ALERT:
 hostname;DOWN;HARD;1;CRITICAL - Host Unreachable (a.b.c.d)
 May 22 15:57:30 nagios-console nagios: HOST NOTIFICATION:
 cbensend;hostname;DOWN;host-notify-by-email-test;CRITICAL -
 Host Unreachable (a.b.c.d)


 Um.  Wat?  Why would the console immediately trigger a hard
 state? The config files don't support this decision.  And this
 IS a problem with the console server - the distributed monitors
 continue checking the host for 6 times like they should.  But
 for some reason, the centralized console just immediately
 calls it a hard state.

*snip*



 Set passive_host_checks_are_soft=1 in nagios.cfg on your master
 server and things should start working as intended.

 --
 Andreas Ericsson   andreas.erics...@op5.se

Oh lord, THANK YOU.  That appears to have fixed that problem, which
was a pain in the ass.  In my defense, I *did* see that option, but
the way I interpreted the comments didn't quite match up with the
behavior I was seeing.  I should have experimented with it, I guess.
A slight adjustment to the comments would have thrown a red flag for
me - perhaps this is just a matter of personal interpretation, but
maybe the comments could be a bit more specific:


diff -uNp nagios-updated.cfg nagios.cfg
--- nagios-updated.cfg  Sat May 25 09:05:09 2013
+++ nagios.cfg  Sat May 25 09:02:37 2013
@@ -981,9 +981,9 @@ translate_passive_host_checks=0

 # PASSIVE HOST CHECKS ARE SOFT OPTION
 # This determines whether or not Nagios will treat passive host
-# checks as being HARD or SOFT.  By default, a single passive host
-# check result will put a host into an immediate HARD state type.
-# This can be changed by enabling this option.
+# checks as being HARD or SOFT.  By default, a passive host check
+# result will put a host into a HARD state type.  This can be changed
+# by enabling this option.
 # Values: 0 = passive checks are HARD, 1 = passive checks are SOFT

 passive_host_checks_are_soft=0


Does that make sense?  If I had read something like that, it would
have been immediately clear to me what was happening.

Thank you so much, Andreas!  On to the next problem with the
upgrade (something that can wait until next week)...

Benny


-- 
The very existence of flamethrowers proves that sometime, somewhere,
someone said to themselves, 'You know, I want to set those people
over there on fire, but I'm just not close enough to get the job
done.'  -- George Carlin


--
Try New Relic Now  We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app,  servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Nagios v3.5.0 transitioning immediately to a HARD state upon host problem

2013-05-25 Thread C. Bensend

 diff -uNp nagios-updated.cfg nagios.cfg
 --- nagios-updated.cfg  Sat May 25 09:05:09 2013
 +++ nagios.cfg  Sat May 25 09:02:37 2013
 @@ -981,9 +981,9 @@ translate_passive_host_checks=0

  # PASSIVE HOST CHECKS ARE SOFT OPTION
  # This determines whether or not Nagios will treat passive host
 -# checks as being HARD or SOFT.  By default, a single passive host
 -# check result will put a host into an immediate HARD state type.
 -# This can be changed by enabling this option.
 +# checks as being HARD or SOFT.  By default, a passive host check
 +# result will put a host into a HARD state type.  This can be changed
 +# by enabling this option.
  # Values: 0 = passive checks are HARD, 1 = passive checks are SOFT

  passive_host_checks_are_soft=0


 Does that make sense?  If I had read something like that, it would
 have been immediately clear to me what was happening.

 Thank you so much, Andreas!  On to the next problem with the
 upgrade (something that can wait until next week)...

Sorry, too little caffeine too early, got the files reversed.  Here's
the right diff:

diff -uNp nagios.cfg nagios-updated.cfg
--- nagios.cfg  Sat May 25 10:25:34 2013
+++ nagios-updated.cfg  Sat May 25 10:27:12 2013
@@ -981,9 +981,9 @@ translate_passive_host_checks=0

 # PASSIVE HOST CHECKS ARE SOFT OPTION
 # This determines whether or not Nagios will treat passive host
-# checks as being HARD or SOFT.  By default, a passive host check
-# result will put a host into a HARD state type.  This can be changed
-# by enabling this option.
+# checks as being HARD or SOFT.  By default, a single passive host
+# check result will put a host into an immediate HARD state type.
+# This can be changed by enabling this option.
 # Values: 0 = passive checks are HARD, 1 = passive checks are SOFT

 passive_host_checks_are_soft=0



-- 
The very existence of flamethrowers proves that sometime, somewhere,
someone said to themselves, 'You know, I want to set those people
over there on fire, but I'm just not close enough to get the job
done.'  -- George Carlin


--
Try New Relic Now  We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app,  servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Nagios v3.5.0 transitioning immediately to a HARD state upon host problem

2013-05-24 Thread Andreas Ericsson
On 2013-05-23 17:43, C. Bensend wrote:

 Hey folks,

 I recently made two major changes to my Nagios environment:

 1) I upgraded to v3.5.0.
 2) I moved from a single server to two pollers sending passive
 results to one central console server.

 Now, this new distributed system was in place for several months
 while I tested, and it worked fine.  HOWEVER, since this was running
 in parallel with my production system, notifications were disabled.
 Hence, I didn't see this problem until I cut over for real and
 enabled notifications.

 (please excuse any cut-n-paste ugliness, had to send this info from
 my work account via Outlook and then try to cleanse and reformat
 via Squirrelmail)

 As a test and to capture information, I reboot 'hostname'.  This
 log is from the nagios-console host, which is the host that accepts
 the passive check results and sends notifications.  Here is the
 console host receiving a service check failure when the host is
 restarting:

 May 22 15:57:10 nagios-console nagios: SERVICE ALERT: hostname;/var disk
 queue;CRITICAL;SOFT;1;Connection refused by host


 So, the distributed poller system checks the host and sends its
 results to the console server:

 May 22 15:57:30 nagios-console nagios: HOST
 ALERT:hostname;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d)


 And then the centralized server IMMEDIATELY goes into a hard state,
 which triggers a  notification:

 May 22 15:57:30 nagios-console nagios: HOST ALERT:
 hostname;DOWN;HARD;1;CRITICAL - Host Unreachable (a.b.c.d)
 May 22 15:57:30 nagios-console nagios: HOST NOTIFICATION:
 cbensend;hostname;DOWN;host-notify-by-email-test;CRITICAL -
 Host Unreachable (a.b.c.d)


 Um.  Wat?  Why would the console immediately trigger a hard
 state? The config files don't support this decision.  And this
 IS a problem with the console server - the distributed monitors
 continue checking the host for 6 times like they should.  But
 for some reason, the centralized console just immediately
 calls it a hard state.

 Definitions on the distributed monitoring host (the one running
 the actual host and service checks for this host 'hostname':

 define host {
   host_namehostname
   aliasOld production Nagios server
   address  a.b.c.d
   action_url   /pnp4nagios/graph?host=$HOSTNAME$
   icon_image_alt   Red Hat Linux
   icon_image   redhat.png
   statusmap_image  redhat.gd2
   check_commandcheck-host-alive
   check_period 24x7
   notification_period  24x7
   contact_groups   linux-infrastructure-admins
   use  linux-host-template
 }

 The linux-host-template on that same system:

 define host {
   name linux-host-template
   register 0
   max_check_attempts   6
   check_interval   5
   retry_interval   1
   notification_interval360
   notification_options d,r
   active_checks_enabled1
   passive_checks_enabled   1
   notifications_enabled1
   check_freshness  0
   check_period 24x7
   notification_period  24x7
   check_commandcheck-host-alive
   contact_groups   linux-infrastructure-admins
 }

 And said command to determine up or down:

 define command {
   command_name check-host-alive
   command_line $USER1$/check_ping -H $HOSTADDRESS$ -w
 5000.0,80% -c 1.0,100% -p 5
 }


 Definitions on the centralized console host (the one that notifies):

 define host {
host_namehostname
aliasOld production Nagios server
address  a.b.c.d
action_url   /pnp4nagios/graph?host=$HOSTNAME$
icon_image_alt   Red Hat Linux
icon_image   redhat.png
statusmap_image  redhat.gd2
check_commandcheck-host-alive
check_period 24x7
notification_period  24x7
contact_groups   linux-infrastructure-admins
use  linux-host-template,Default_monitor_server
 }

 The Default monitor server template on the centralized server:

 define host {
name Default_monitor_server
register 0
active_checks_enabled0
passive_checks_enabled   1
notifications_enabled1
check_freshness  0
freshness_threshold  86400
 }

 And the linux-host-template template on that same centralized host:

 define host {
 namelinux-host-template
 register0
 max_check_attempts  6
 check_interval  5
 retry_interval  1
 notification_interval   360
  

[Nagios-users] Nagios v3.5.0 transitioning immediately to a HARD state upon host problem

2013-05-23 Thread C. Bensend

Hey folks,

   I recently made two major changes to my Nagios environment:

1) I upgraded to v3.5.0.
2) I moved from a single server to two pollers sending passive
   results to one central console server.

   Now, this new distributed system was in place for several months
while I tested, and it worked fine.  HOWEVER, since this was running
in parallel with my production system, notifications were disabled.
Hence, I didn't see this problem until I cut over for real and
enabled notifications.

(please excuse any cut-n-paste ugliness, had to send this info from
my work account via Outlook and then try to cleanse and reformat
via Squirrelmail)

   As a test and to capture information, I reboot 'hostname'.  This
log is from the nagios-console host, which is the host that accepts
the passive check results and sends notifications.  Here is the
console host receiving a service check failure when the host is
restarting:

May 22 15:57:10 nagios-console nagios: SERVICE ALERT: hostname;/var disk
queue;CRITICAL;SOFT;1;Connection refused by host


So, the distributed poller system checks the host and sends its
results to the console server:

May 22 15:57:30 nagios-console nagios: HOST
ALERT:hostname;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d)


And then the centralized server IMMEDIATELY goes into a hard state,
which triggers a  notification:

May 22 15:57:30 nagios-console nagios: HOST ALERT:
hostname;DOWN;HARD;1;CRITICAL - Host Unreachable (a.b.c.d)
May 22 15:57:30 nagios-console nagios: HOST NOTIFICATION:
cbensend;hostname;DOWN;host-notify-by-email-test;CRITICAL -
Host Unreachable (a.b.c.d)


   Um.  Wat?  Why would the console immediately trigger a hard
state? The config files don't support this decision.  And this
IS a problem with the console server - the distributed monitors
continue checking the host for 6 times like they should.  But
for some reason, the centralized console just immediately
calls it a hard state.

   Definitions on the distributed monitoring host (the one running
the actual host and service checks for this host 'hostname':

define host {
 host_namehostname
 aliasOld production Nagios server
 address  a.b.c.d
 action_url   /pnp4nagios/graph?host=$HOSTNAME$
 icon_image_alt   Red Hat Linux
 icon_image   redhat.png
 statusmap_image  redhat.gd2
 check_commandcheck-host-alive
 check_period 24x7
 notification_period  24x7
 contact_groups   linux-infrastructure-admins
 use  linux-host-template
}

The linux-host-template on that same system:

define host {
 name linux-host-template
 register 0
 max_check_attempts   6
 check_interval   5
 retry_interval   1
 notification_interval360
 notification_options d,r
 active_checks_enabled1
 passive_checks_enabled   1
 notifications_enabled1
 check_freshness  0
 check_period 24x7
 notification_period  24x7
 check_commandcheck-host-alive
 contact_groups   linux-infrastructure-admins
}

And said command to determine up or down:

define command {
 command_name check-host-alive
 command_line $USER1$/check_ping -H $HOSTADDRESS$ -w
5000.0,80% -c 1.0,100% -p 5
}


Definitions on the centralized console host (the one that notifies):

define host {
  host_namehostname
  aliasOld production Nagios server
  address  a.b.c.d
  action_url   /pnp4nagios/graph?host=$HOSTNAME$
  icon_image_alt   Red Hat Linux
  icon_image   redhat.png
  statusmap_image  redhat.gd2
  check_commandcheck-host-alive
  check_period 24x7
  notification_period  24x7
  contact_groups   linux-infrastructure-admins
  use  linux-host-template,Default_monitor_server
}

The Default monitor server template on the centralized server:

define host {
  name Default_monitor_server
  register 0
  active_checks_enabled0
  passive_checks_enabled   1
  notifications_enabled1
  check_freshness  0
  freshness_threshold  86400
}

And the linux-host-template template on that same centralized host:

define host {
   namelinux-host-template
   register0
   max_check_attempts  6
   check_interval  5
   retry_interval  1
   notification_interval   360
   notification_optionsd,r
   active_checks_enabled   1
   passive_checks_enabled  1
   notifications_enabled   1
   check_freshness 0
   check_period24x7
   

Re: [Nagios-users] Nagios v3.5.0 transitioning immediately to a HARD state upon host problem

2013-05-23 Thread Doug Eubanks
I ran into a similar problem, because my template set the service to *
is_volatile=1*.

http://nagios.sourceforge.net/docs/3_0/volatileservices.html

Check to see if you have this flag enabled.

Doug

Sincerely,
Doug Eubanks
ad...@dougware.net
K1DUG
(919) 201-8750


On Thu, May 23, 2013 at 11:43 AM, C. Bensend be...@bennyvision.com wrote:


 Hey folks,

I recently made two major changes to my Nagios environment:

 1) I upgraded to v3.5.0.
 2) I moved from a single server to two pollers sending passive
results to one central console server.

Now, this new distributed system was in place for several months
 while I tested, and it worked fine.  HOWEVER, since this was running
 in parallel with my production system, notifications were disabled.
 Hence, I didn't see this problem until I cut over for real and
 enabled notifications.

 (please excuse any cut-n-paste ugliness, had to send this info from
 my work account via Outlook and then try to cleanse and reformat
 via Squirrelmail)

As a test and to capture information, I reboot 'hostname'.  This
 log is from the nagios-console host, which is the host that accepts
 the passive check results and sends notifications.  Here is the
 console host receiving a service check failure when the host is
 restarting:

 May 22 15:57:10 nagios-console nagios: SERVICE ALERT: hostname;/var disk
 queue;CRITICAL;SOFT;1;Connection refused by host


 So, the distributed poller system checks the host and sends its
 results to the console server:

 May 22 15:57:30 nagios-console nagios: HOST
 ALERT:hostname;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d)


 And then the centralized server IMMEDIATELY goes into a hard state,
 which triggers a  notification:

 May 22 15:57:30 nagios-console nagios: HOST ALERT:
 hostname;DOWN;HARD;1;CRITICAL - Host Unreachable (a.b.c.d)
 May 22 15:57:30 nagios-console nagios: HOST NOTIFICATION:
 cbensend;hostname;DOWN;host-notify-by-email-test;CRITICAL -
 Host Unreachable (a.b.c.d)


Um.  Wat?  Why would the console immediately trigger a hard
 state? The config files don't support this decision.  And this
 IS a problem with the console server - the distributed monitors
 continue checking the host for 6 times like they should.  But
 for some reason, the centralized console just immediately
 calls it a hard state.

Definitions on the distributed monitoring host (the one running
 the actual host and service checks for this host 'hostname':

 define host {
  host_namehostname
  aliasOld production Nagios server
  address  a.b.c.d
  action_url   /pnp4nagios/graph?host=$HOSTNAME$
  icon_image_alt   Red Hat Linux
  icon_image   redhat.png
  statusmap_image  redhat.gd2
  check_commandcheck-host-alive
  check_period 24x7
  notification_period  24x7
  contact_groups   linux-infrastructure-admins
  use  linux-host-template
 }

 The linux-host-template on that same system:

 define host {
  name linux-host-template
  register 0
  max_check_attempts   6
  check_interval   5
  retry_interval   1
  notification_interval360
  notification_options d,r
  active_checks_enabled1
  passive_checks_enabled   1
  notifications_enabled1
  check_freshness  0
  check_period 24x7
  notification_period  24x7
  check_commandcheck-host-alive
  contact_groups   linux-infrastructure-admins
 }

 And said command to determine up or down:

 define command {
  command_name check-host-alive
  command_line $USER1$/check_ping -H $HOSTADDRESS$ -w
 5000.0,80% -c 1.0,100% -p 5
 }


 Definitions on the centralized console host (the one that notifies):

 define host {
   host_namehostname
   aliasOld production Nagios server
   address  a.b.c.d
   action_url   /pnp4nagios/graph?host=$HOSTNAME$
   icon_image_alt   Red Hat Linux
   icon_image   redhat.png
   statusmap_image  redhat.gd2
   check_commandcheck-host-alive
   check_period 24x7
   notification_period  24x7
   contact_groups   linux-infrastructure-admins
   use  linux-host-template,Default_monitor_server
 }

 The Default monitor server template on the centralized server:

 define host {
   name Default_monitor_server
   register 0
   active_checks_enabled0
   passive_checks_enabled   1
   notifications_enabled1
   check_freshness  0
   freshness_threshold  86400
 }

 And the linux-host-template template on that same centralized host:

 

Re: [Nagios-users] Nagios v3.5.0 transitioning immediately to a HARD state upon host problem

2013-05-23 Thread C. Bensend

 I ran into a similar problem, because my template set the service to *
 is_volatile=1*.

 http://nagios.sourceforge.net/docs/3_0/volatileservices.html

Hrmmm.  Good point...

However, is_volatile does not appear in any of my configuration
files, for any of the Nagios servers.  It isn't set by default,
is it?  The Nagios config.cgi page doesn't even list it, and
livestatus (what I use to query my running daemon) doesn't give
it as a column it can query.  I can't imagine it's on by default
in v3.5.0, but I can't really tell if it is or not.

I can try explicitly *disabling* it in all hosts, but I can't
really test that at the moment - out of here for a long weekend
in a few minutes.  If it gets annoying enough over the weekend,
I might *have* to test that theory.

Thank you very much.  I will still appreciate any input others can
give on this question - it just doesn't seem to be behaving
as it's configured!

Benny


-- 
The very existence of flamethrowers proves that sometime, somewhere,
someone said to themselves, 'You know, I want to set those people
over there on fire, but I'm just not close enough to get the job
done.'  -- George Carlin


--
Try New Relic Now  We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app,  servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null