[Nagios-users] Separate mail server problems cause Nagios to plotz (or vice versa?)

2011-06-24 Thread up
We have Nagios monitoring a variety of services on roughly 50 separate servers. 
 Several of them
are mail servers, but only the main (that contains most of the Nagios 
notification recipients)
one has this problem.

The mail server will start to become unresponsive so just about any input (but 
pings fine). 
Simultaneously, Nagios, which is on a separate server, will send out 
notifications that every
service on every server is down because Nagios cannot reach them.  Since almost 
all of them go
through this problem mail server, including those that forward to text 
messaging services, they
will stop and resume again when the mail server is either rebooted, or 
otherwise is brought back
to life...sometimes by restarting the LDAP server process on it.

There are perhaps a few dozen total email destinations for notifications.  Even 
multiplying this
times the total number of services that Nagios monitors, it doesn't seem likely 
that it's just
volume of emails generated by Nagios would cause all this.  It is a fairly 
modern, multiprocessor
server (CentOS/Sendmail).

Can anyone offer any insight or similar experiences?

Thanks in Advance!

--
All the data continuously generated in your IT infrastructure contains a 
definitive record of customers, application performance, security 
threats, fraudulent activity and more. Splunk takes this data and makes 
sense of it. Business sense. IT sense. Common sense.. 
http://p.sf.net/sfu/splunk-d2d-c1
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Separate mail server problems cause Nagios to plotz (or vice versa?)

2011-06-24 Thread up
 Quoting u...@3.am:

 We have Nagios monitoring a variety of services on roughly 50
 separate servers.nbsp; Several of them
 are mail servers, but only the main (that contains most of the
 Nagios notification recipients)
 one has this problem.

 The mail server will start to become unresponsive so just about any

 input (but pings fine).

 This is a mail server issue. You would need to determine exactly what
 process(es) have become unresponsive and why.

We're still trying to figure that out...but the question for this list
is why Nagios would go nuts.


 Simultaneously, Nagios, which is on a separate server, will send
 out
 notifications that every
 service on every server is down because Nagios cannot reach them.nbsp;


 Why can't it reach them? Is your mail server also your router?

Good Gosh, no!  That's why this is so puzzling.

Thanks for your response.

 Terry

 Since almost all of them go
 through this problem mail server, including those that forward to
 text messaging services, they
 will stop and resume again when the mail server is either rebooted,

 or otherwise is brought back
 to life...sometimes by restarting the LDAP server process on it.

 There are perhaps a few dozen total email destinations for
 notifications.nbsp; Even multiplying this
 times the total number of services that Nagios monitors, it doesn't

 seem likely that it's just
 volume of emails generated by Nagios would cause all this.nbsp; It is
 a
 fairly modern, multiprocessor
 server (CentOS/Sendmail).

 Can anyone offer any insight or similar experiences?

 Thanks in Advance!


 --
 All the data continuously generated in your IT infrastructure
 contains a
 definitive record of customers, application performance, security
 threats, fraudulent activity and more. Splunk takes this data and
 makes
 sense of it. Business sense. IT sense. Common sense..
 http://p.sf.net/sfu/splunk-d2d-c1
 ___
 Nagios-users mailing list
 Nagios-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/nagios-users
 ::: Please include Nagios version, plugin version (-v) and OS when
 reporting any issue.
 ::: Messages without supporting info will risk being sent to
 /dev/null


 nbsp;

 --
 Terry Carmen
 CNY Support, LLC
 Web. Database. Business.
 http://www.cnysupport.com

 --
 All the data continuously generated in your IT infrastructure contains a
 definitive record of customers, application performance, security
 threats, fraudulent activity and more. Splunk takes this data and makes
 sense of it. Business sense. IT sense. Common sense..
 http://p.sf.net/sfu/splunk-d2d-c1
 ___
 Nagios-users mailing list
 Nagios-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/nagios-users
 ::: Please include Nagios version, plugin version (-v) and OS when reporting 
 any issue.
 ::: Messages without supporting info will risk being sent to /dev/null



--
All the data continuously generated in your IT infrastructure contains a 
definitive record of customers, application performance, security 
threats, fraudulent activity and more. Splunk takes this data and makes 
sense of it. Business sense. IT sense. Common sense.. 
http://p.sf.net/sfu/splunk-d2d-c1
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Separate mail server problems cause Nagios to plotz (or vice versa?)

2011-06-24 Thread Allan Clark
On Fri, Jun 24, 2011 at 11:53,  u...@3.am wrote:
 Quoting u...@3.am:

 We have Nagios monitoring a variety of services on roughly 50
 separate servers.nbsp; Several of them
 are mail servers, but only the main (that contains most of the
 Nagios notification recipients)
 one has this problem.

 The mail server will start to become unresponsive so just about any

 input (but pings fine).

 This is a mail server issue. You would need to determine exactly what
 process(es) have become unresponsive and why.

 We're still trying to figure that out...but the question for this list
 is why Nagios would go nuts.

Do you have any staleness stuff on the tests that go nuts?

Is it possible to place many of the sendmail tests (ie if you're
checking mqueue) as dependencies of another test (such as is it
responding to port 25?) so that when the sendmail gets strange, at
least many of the tests are then skipped?


 Simultaneously, Nagios, which is on a separate server, will send
 out
 notifications that every
 service on every server is down because Nagios cannot reach them.nbsp;


 Why can't it reach them? Is your mail server also your router?

 Good Gosh, no!  That's why this is so puzzling.

re: staleness above: can you watch your Nagios log, perhaps filtering
it through awk to add a timestamp to each entry, just spool that on a
terminal, and when things get strange and Nagios goes nuts, is Nagios
at least running the tests and getting responses?

You mention LDAP; is your sendmail server also your LDAP server, and
is the Nagios host also using LDAP to resolve basic OS features like
UID?

Allan
-- 
all...@chickenandporn.com  金鱼 http://linkedin.com/in/goldfish

--
All the data continuously generated in your IT infrastructure contains a 
definitive record of customers, application performance, security 
threats, fraudulent activity and more. Splunk takes this data and makes 
sense of it. Business sense. IT sense. Common sense.. 
http://p.sf.net/sfu/splunk-d2d-c1
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] Separate mail server problems cause Nagios to plotz (or vice versa?)

2011-06-24 Thread up
 On Fri, Jun 24, 2011 at 11:53,  u...@3.am wrote:
 Quoting u...@3.am:

 We have Nagios monitoring a variety of services on roughly 50
 separate servers.nbsp; Several of them
 are mail servers, but only the main (that contains most of the
 Nagios notification recipients)
 one has this problem.

 The mail server will start to become unresponsive so just about any

 input (but pings fine).

 This is a mail server issue. You would need to determine exactly what
 process(es) have become unresponsive and why.

 We're still trying to figure that out...but the question for this list
 is why Nagios would go nuts.

 Do you have any staleness stuff on the tests that go nuts?

 Is it possible to place many of the sendmail tests (ie if you're
 checking mqueue) as dependencies of another test (such as is it
 responding to port 25?) so that when the sendmail gets strange, at
 least many of the tests are then skipped?

The only sendmail specific test we use for nagios is the simple SMTP test.

 Simultaneously, Nagios, which is on a separate server, will send
 out
 notifications that every
 service on every server is down because Nagios cannot reach them.nbsp;


 Why can't it reach them? Is your mail server also your router?

 Good Gosh, no!  That's why this is so puzzling.

 re: staleness above: can you watch your Nagios log, perhaps filtering
 it through awk to add a timestamp to each entry, just spool that on a
 terminal, and when things get strange and Nagios goes nuts, is Nagios
 at least running the tests and getting responses?

I'll try to grock something out of the nagios logs, but this is one of those 
problems that happens
every few days, so it's hard to monitor it constantly (monitor the monitoring 
software?!).

 You mention LDAP; is your sendmail server also your LDAP server, and
 is the Nagios host also using LDAP to resolve basic OS features like
 UID?

Yes, it is the LDAP server, but it is not used for DNS...it is only used for
 user authentication.



--
All the data continuously generated in your IT infrastructure contains a 
definitive record of customers, application performance, security 
threats, fraudulent activity and more. Splunk takes this data and makes 
sense of it. Business sense. IT sense. Common sense.. 
http://p.sf.net/sfu/splunk-d2d-c1
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Separate mail server problems cause Nagios to plotz (or vice versa?)

2011-06-24 Thread Terry Carmen

 Simultaneously, Nagios, which is on a separate server, will send
 out
 notifications that every
 service on every server is down because Nagios cannot reach them.nbsp;


 Why can't it reach them? Is your mail server also your router?

 Good Gosh, no!  That's why this is so puzzling.

The next time it happens, unplug your mail server's network connection (it 
failed anyway). I'll bet it's flooding the network with (good?/bad?) packets 
and nagios can't get through.

It taking the mailserver offline fixes it, at least you know where to look,

I'd also check the mailserver logs. Some aren't too bright about handling 
bounces and if it's misconfiigured, you can end up with an infinite number of 
bounce messages for the bounce messages.

Terry
--
All the data continuously generated in your IT infrastructure contains a 
definitive record of customers, application performance, security 
threats, fraudulent activity and more. Splunk takes this data and makes 
sense of it. Business sense. IT sense. Common sense.. 
http://p.sf.net/sfu/splunk-d2d-c1___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] Separate mail server problems cause Nagios to plotz (or vice versa?)

2011-06-24 Thread up

 Simultaneously, Nagios, which is on a separate server, will send
 out
 notifications that every
 service on every server is down because Nagios cannot reach them.nbsp;


 Why can't it reach them? Is your mail server also your router?

 Good Gosh, no!  That's why this is so puzzling.

 The next time it happens, unplug your mail server's network connection (it 
 failed
 anyway). I'll bet it's flooding the network with (good?/bad?) packets and 
 nagios
 can't get through.

We've got good switches (newer Catalysts) and we're not seeing other servers on
the same VLAN or switch affected.

 It taking the mailserver offline fixes it, at least you know where to look,

 I'd also check the mailserver logs. Some aren't too bright about handling 
 bounces
 and if it's misconfiigured, you can end up with an infinite number of bounce
 messages for the bounce messages.

Looking for mail loops sounds like a reasonable start.  I'm not as used to
sendmail as I am qmail, which seems to handle preventing loops a little better,
AFAICT.  I posted to the list to rule out a known issue with nagios, which it
looks like isn't the problem.

Thanks again!


--
All the data continuously generated in your IT infrastructure contains a 
definitive record of customers, application performance, security 
threats, fraudulent activity and more. Splunk takes this data and makes 
sense of it. Business sense. IT sense. Common sense.. 
http://p.sf.net/sfu/splunk-d2d-c1
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null