[Nagios-users] Separate mail server problems cause Nagios to plotz (or vice versa?)
We have Nagios monitoring a variety of services on roughly 50 separate servers. Several of them are mail servers, but only the main (that contains most of the Nagios notification recipients) one has this problem. The mail server will start to become unresponsive so just about any input (but pings fine). Simultaneously, Nagios, which is on a separate server, will send out notifications that every service on every server is down because Nagios cannot reach them. Since almost all of them go through this problem mail server, including those that forward to text messaging services, they will stop and resume again when the mail server is either rebooted, or otherwise is brought back to life...sometimes by restarting the LDAP server process on it. There are perhaps a few dozen total email destinations for notifications. Even multiplying this times the total number of services that Nagios monitors, it doesn't seem likely that it's just volume of emails generated by Nagios would cause all this. It is a fairly modern, multiprocessor server (CentOS/Sendmail). Can anyone offer any insight or similar experiences? Thanks in Advance! -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense.. http://p.sf.net/sfu/splunk-d2d-c1 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Separate mail server problems cause Nagios to plotz (or vice versa?)
Quoting u...@3.am: We have Nagios monitoring a variety of services on roughly 50 separate servers.nbsp; Several of them are mail servers, but only the main (that contains most of the Nagios notification recipients) one has this problem. The mail server will start to become unresponsive so just about any input (but pings fine). This is a mail server issue. You would need to determine exactly what process(es) have become unresponsive and why. We're still trying to figure that out...but the question for this list is why Nagios would go nuts. Simultaneously, Nagios, which is on a separate server, will send out notifications that every service on every server is down because Nagios cannot reach them.nbsp; Why can't it reach them? Is your mail server also your router? Good Gosh, no! That's why this is so puzzling. Thanks for your response. Terry Since almost all of them go through this problem mail server, including those that forward to text messaging services, they will stop and resume again when the mail server is either rebooted, or otherwise is brought back to life...sometimes by restarting the LDAP server process on it. There are perhaps a few dozen total email destinations for notifications.nbsp; Even multiplying this times the total number of services that Nagios monitors, it doesn't seem likely that it's just volume of emails generated by Nagios would cause all this.nbsp; It is a fairly modern, multiprocessor server (CentOS/Sendmail). Can anyone offer any insight or similar experiences? Thanks in Advance! -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense.. http://p.sf.net/sfu/splunk-d2d-c1 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null nbsp; -- Terry Carmen CNY Support, LLC Web. Database. Business. http://www.cnysupport.com -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense.. http://p.sf.net/sfu/splunk-d2d-c1 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense.. http://p.sf.net/sfu/splunk-d2d-c1 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Separate mail server problems cause Nagios to plotz (or vice versa?)
On Fri, Jun 24, 2011 at 11:53, u...@3.am wrote: Quoting u...@3.am: We have Nagios monitoring a variety of services on roughly 50 separate servers.nbsp; Several of them are mail servers, but only the main (that contains most of the Nagios notification recipients) one has this problem. The mail server will start to become unresponsive so just about any input (but pings fine). This is a mail server issue. You would need to determine exactly what process(es) have become unresponsive and why. We're still trying to figure that out...but the question for this list is why Nagios would go nuts. Do you have any staleness stuff on the tests that go nuts? Is it possible to place many of the sendmail tests (ie if you're checking mqueue) as dependencies of another test (such as is it responding to port 25?) so that when the sendmail gets strange, at least many of the tests are then skipped? Simultaneously, Nagios, which is on a separate server, will send out notifications that every service on every server is down because Nagios cannot reach them.nbsp; Why can't it reach them? Is your mail server also your router? Good Gosh, no! That's why this is so puzzling. re: staleness above: can you watch your Nagios log, perhaps filtering it through awk to add a timestamp to each entry, just spool that on a terminal, and when things get strange and Nagios goes nuts, is Nagios at least running the tests and getting responses? You mention LDAP; is your sendmail server also your LDAP server, and is the Nagios host also using LDAP to resolve basic OS features like UID? Allan -- all...@chickenandporn.com 金鱼 http://linkedin.com/in/goldfish -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense.. http://p.sf.net/sfu/splunk-d2d-c1 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Separate mail server problems cause Nagios to plotz (or vice versa?)
On Fri, Jun 24, 2011 at 11:53, u...@3.am wrote: Quoting u...@3.am: We have Nagios monitoring a variety of services on roughly 50 separate servers.nbsp; Several of them are mail servers, but only the main (that contains most of the Nagios notification recipients) one has this problem. The mail server will start to become unresponsive so just about any input (but pings fine). This is a mail server issue. You would need to determine exactly what process(es) have become unresponsive and why. We're still trying to figure that out...but the question for this list is why Nagios would go nuts. Do you have any staleness stuff on the tests that go nuts? Is it possible to place many of the sendmail tests (ie if you're checking mqueue) as dependencies of another test (such as is it responding to port 25?) so that when the sendmail gets strange, at least many of the tests are then skipped? The only sendmail specific test we use for nagios is the simple SMTP test. Simultaneously, Nagios, which is on a separate server, will send out notifications that every service on every server is down because Nagios cannot reach them.nbsp; Why can't it reach them? Is your mail server also your router? Good Gosh, no! Â That's why this is so puzzling. re: staleness above: can you watch your Nagios log, perhaps filtering it through awk to add a timestamp to each entry, just spool that on a terminal, and when things get strange and Nagios goes nuts, is Nagios at least running the tests and getting responses? I'll try to grock something out of the nagios logs, but this is one of those problems that happens every few days, so it's hard to monitor it constantly (monitor the monitoring software?!). You mention LDAP; is your sendmail server also your LDAP server, and is the Nagios host also using LDAP to resolve basic OS features like UID? Yes, it is the LDAP server, but it is not used for DNS...it is only used for user authentication. -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense.. http://p.sf.net/sfu/splunk-d2d-c1 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Separate mail server problems cause Nagios to plotz (or vice versa?)
Simultaneously, Nagios, which is on a separate server, will send out notifications that every service on every server is down because Nagios cannot reach them.nbsp; Why can't it reach them? Is your mail server also your router? Good Gosh, no! That's why this is so puzzling. The next time it happens, unplug your mail server's network connection (it failed anyway). I'll bet it's flooding the network with (good?/bad?) packets and nagios can't get through. It taking the mailserver offline fixes it, at least you know where to look, I'd also check the mailserver logs. Some aren't too bright about handling bounces and if it's misconfiigured, you can end up with an infinite number of bounce messages for the bounce messages. Terry -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense.. http://p.sf.net/sfu/splunk-d2d-c1___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Separate mail server problems cause Nagios to plotz (or vice versa?)
Simultaneously, Nagios, which is on a separate server, will send out notifications that every service on every server is down because Nagios cannot reach them.nbsp; Why can't it reach them? Is your mail server also your router? Good Gosh, no! That's why this is so puzzling. The next time it happens, unplug your mail server's network connection (it failed anyway). I'll bet it's flooding the network with (good?/bad?) packets and nagios can't get through. We've got good switches (newer Catalysts) and we're not seeing other servers on the same VLAN or switch affected. It taking the mailserver offline fixes it, at least you know where to look, I'd also check the mailserver logs. Some aren't too bright about handling bounces and if it's misconfiigured, you can end up with an infinite number of bounce messages for the bounce messages. Looking for mail loops sounds like a reasonable start. I'm not as used to sendmail as I am qmail, which seems to handle preventing loops a little better, AFAICT. I posted to the list to rule out a known issue with nagios, which it looks like isn't the problem. Thanks again! -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense.. http://p.sf.net/sfu/splunk-d2d-c1 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null