On 2013-05-23 17:43, C. Bensend wrote:
Hey folks,
I recently made two major changes to my Nagios environment:
1) I upgraded to v3.5.0.
2) I moved from a single server to two pollers sending passive
results to one central console server.
Now, this new distributed system was in place for several months
while I tested, and it worked fine. HOWEVER, since this was running
in parallel with my production system, notifications were disabled.
Hence, I didn't see this problem until I cut over for real and
enabled notifications.
(please excuse any cut-n-paste ugliness, had to send this info from
my work account via Outlook and then try to cleanse and reformat
via Squirrelmail)
As a test and to capture information, I reboot 'hostname'. This
log is from the nagios-console host, which is the host that accepts
the passive check results and sends notifications. Here is the
console host receiving a service check failure when the host is
restarting:
May 22 15:57:10 nagios-console nagios: SERVICE ALERT: hostname;/var disk
queue;CRITICAL;SOFT;1;Connection refused by host
So, the distributed poller system checks the host and sends its
results to the console server:
May 22 15:57:30 nagios-console nagios: HOST
ALERT:hostname;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d)
And then the centralized server IMMEDIATELY goes into a hard state,
which triggers a notification:
May 22 15:57:30 nagios-console nagios: HOST ALERT:
hostname;DOWN;HARD;1;CRITICAL - Host Unreachable (a.b.c.d)
May 22 15:57:30 nagios-console nagios: HOST NOTIFICATION:
cbensend;hostname;DOWN;host-notify-by-email-test;CRITICAL -
Host Unreachable (a.b.c.d)
Um. Wat? Why would the console immediately trigger a hard
state? The config files don't support this decision. And this
IS a problem with the console server - the distributed monitors
continue checking the host for 6 times like they should. But
for some reason, the centralized console just immediately
calls it a hard state.
Definitions on the distributed monitoring host (the one running
the actual host and service checks for this host 'hostname':
define host {
host_namehostname
aliasOld production Nagios server
address a.b.c.d
action_url /pnp4nagios/graph?host=$HOSTNAME$
icon_image_alt Red Hat Linux
icon_image redhat.png
statusmap_image redhat.gd2
check_commandcheck-host-alive
check_period 24x7
notification_period 24x7
contact_groups linux-infrastructure-admins
use linux-host-template
}
The linux-host-template on that same system:
define host {
name linux-host-template
register 0
max_check_attempts 6
check_interval 5
retry_interval 1
notification_interval360
notification_options d,r
active_checks_enabled1
passive_checks_enabled 1
notifications_enabled1
check_freshness 0
check_period 24x7
notification_period 24x7
check_commandcheck-host-alive
contact_groups linux-infrastructure-admins
}
And said command to determine up or down:
define command {
command_name check-host-alive
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w
5000.0,80% -c 1.0,100% -p 5
}
Definitions on the centralized console host (the one that notifies):
define host {
host_namehostname
aliasOld production Nagios server
address a.b.c.d
action_url /pnp4nagios/graph?host=$HOSTNAME$
icon_image_alt Red Hat Linux
icon_image redhat.png
statusmap_image redhat.gd2
check_commandcheck-host-alive
check_period 24x7
notification_period 24x7
contact_groups linux-infrastructure-admins
use linux-host-template,Default_monitor_server
}
The Default monitor server template on the centralized server:
define host {
name Default_monitor_server
register 0
active_checks_enabled0
passive_checks_enabled 1
notifications_enabled1
check_freshness 0
freshness_threshold 86400
}
And the linux-host-template template on that same centralized host:
define host {
namelinux-host-template
register0
max_check_attempts 6
check_interval 5
retry_interval 1
notification_interval 360