Re: [Nagios-users] Problems with FreeBSD and Nagios
-Original Message- From: [EMAIL PROTECTED] [mailto:nagios-users- [EMAIL PROTECTED] On Behalf Of Douglas K. Rand Sent: Tuesday, June 19, 2007 3:16 PM To: Kyle Sexton Cc: nagios-users@lists.sourceforge.net Subject: Re: [Nagios-users] Problems with FreeBSD and Nagios Doug The following entry in /etc/libmap.conf has, for us, solved the issue Doug of run away Nagios processes. Doug [nagios] Doug libpthread.so.2 libthr.so.2 Doug libpthread.so libthr.so Doug This is on FreeBSD 6.2. Kyle Was there a recompile or anything necessary? No. You do have to stop and restart the nagios process after the edit. A restart via the web interface is not sufficient. libmap.conf is a runtime configuration. It's been about 24 hours since I implemented this dependency mapping on one of my more heavily used FreeBSD 6.2/Nagios 2.9 servers. I have not had any problems with child processes and my load average actually dropped from around 7.5 to 4. I'll give it a week or two before I declare it a complete success, but it has been great so far! Jonathan - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Problems with FreeBSD and Nagios
On Mon, Jun 18, 2007 at 06:42:18PM -0500, Kyle Sexton wrote: On 12/14/06, Andreas Ericsson [EMAIL PROTECTED] wrote: Jonathan Call wrote: Given your ideas and some google work I seem to have found my problem: http://lists.freebsd.org/pipermail/freebsd-hackers/2005-August/013247.ht ml Not a pretty discussion. :( Nope. Definitely not. The problem for Nagios is that threading was added after the fact so nagios actually breaks some of the *strong* recommendations on what to do and what not to do in a threaded application after a fork(). The problem for *BSD and their thread implementation of the thread library is that Nagios actually works everywhere but on *BSD, and it *often* works there too, but not always. This often-but-not-always is usually a sign of a broken implementation, although exactly often-but-not-always is a sign of the errors you'll run into when you do what Nagios does post-fork(). I don't know of any other program that has the same problem on *BSD, but it would be interesting to see if there's a common pattern so one can pinpoint the exact pattern that causes the lock contention and races. It would, from a practical point of view, be best to patch it in the library, as that is a fix that would work for all possible future problems as well, although it's technically more correct to fix it in Nagios. Ugly discussion indeed. I'll try using a non SMP kernel to see it might help. If it doesn't this pretty much renders Nagios useless on FreeBSD. (Which makes me wonder why they even bother maintaining it in ports?) Out of curiousity, do you use passive checks, active checks or a mix of both in your setup? Was there ever a solution found to this problem? Skimming the (long) discussion thread, my first thought is to try libthr instead of libkse. The discussion seems to be on 5.x, I'd definitely try libthr on 6.x. Check libmap.conf for details. ==ml -- Michael W. Lucas[EMAIL PROTECTED], [EMAIL PROTECTED] http://www.BlackHelicopters.org/~mwlucas/ Coming Soon: Absolute FreeBSD -- http://www.AbsoluteFreeBSD.com On 5/4/2007, the TSA kept 3 pairs of my soiled undies for security reasons. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Problems with FreeBSD and Nagios
-Original Message- From: [EMAIL PROTECTED] [mailto:nagios-users- [EMAIL PROTECTED] On Behalf Of Michael W. Lucas Sent: Tuesday, June 19, 2007 5:16 AM To: Kyle Sexton Cc: nagios-users@lists.sourceforge.net Subject: Re: [Nagios-users] Problems with FreeBSD and Nagios On Mon, Jun 18, 2007 at 06:42:18PM -0500, Kyle Sexton wrote: On 12/14/06, Andreas Ericsson [EMAIL PROTECTED] wrote: Jonathan Call wrote: Given your ideas and some google work I seem to have found my problem: http://lists.freebsd.org/pipermail/freebsd-hackers/2005- August/013247.ht ml Not a pretty discussion. :( Nope. Definitely not. The problem for Nagios is that threading was added after the fact so nagios actually breaks some of the *strong* recommendations on what to do and what not to do in a threaded application after a fork(). The problem for *BSD and their thread implementation of the thread library is that Nagios actually works everywhere but on *BSD, and it *often* works there too, but not always. This often-but-not-always is usually a sign of a broken implementation, although exactly often-but-not-always is a sign of the errors you'll run into when you do what Nagios does post-fork(). I don't know of any other program that has the same problem on *BSD, but it would be interesting to see if there's a common pattern so one can pinpoint the exact pattern that causes the lock contention and races. It would, from a practical point of view, be best to patch it in the library, as that is a fix that would work for all possible future problems as well, although it's technically more correct to fix it in Nagios. Ugly discussion indeed. I'll try using a non SMP kernel to see it might help. If it doesn't this pretty much renders Nagios useless on FreeBSD. (Which makes me wonder why they even bother maintaining it in ports?) Out of curiousity, do you use passive checks, active checks or a mix of both in your setup? Was there ever a solution found to this problem? No. I was forced to implement a distributed model and limit the service checks to less than 1000 on a server. Even then I still have to run a cron job that checks for nagios children than are spinning on the CPU as a result of this fork issue. I've found that somewhere after 1500+ service checks there will be a random weekly event that causes almost a hundred nagios checks to hit this fork issue all at the same time and promptly tank the FreeBSD server. Skimming the (long) discussion thread, my first thought is to try libthr instead of libkse. The discussion seems to be on 5.x, I'd definitely try libthr on 6.x. Check libmap.conf for details. Are you referring to this type of mapping within /etc/libmap.conf? [/usr/local/bin/nagios] libpthread.so.2 libthr.so.2 libpthread.so libthr.so If so I'd be willing to try it on my FreeBSD 6.2 server. Jonathan - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Problems with FreeBSD and Nagios
Michael Skimming the (long) discussion thread, my first thought is to Michael try libthr instead of libkse. The discussion seems to be on Michael 5.x, I'd definitely try libthr on 6.x. Check libmap.conf for Michael details. The following entry in /etc/libmap.conf has, for us, solved the issue of run away Nagios processes. [nagios] libpthread.so.2 libthr.so.2 libpthread.so libthr.so This is on FreeBSD 6.2. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Problems with FreeBSD and Nagios
On 19 Jun 2007 13:37:15 -0500, Douglas K. Rand [EMAIL PROTECTED] wrote: Michael Skimming the (long) discussion thread, my first thought is to Michael try libthr instead of libkse. The discussion seems to be on Michael 5.x, I'd definitely try libthr on 6.x. Check libmap.conf for Michael details. The following entry in /etc/libmap.conf has, for us, solved the issue of run away Nagios processes. [nagios] libpthread.so.2 libthr.so.2 libpthread.so libthr.so This is on FreeBSD 6.2. Was there a recompile or anything necessary? -- Kyle Sexton - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Problems with FreeBSD and Nagios
On 12/14/06, Andreas Ericsson [EMAIL PROTECTED] wrote: Jonathan Call wrote: Given your ideas and some google work I seem to have found my problem: http://lists.freebsd.org/pipermail/freebsd-hackers/2005-August/013247.ht ml Not a pretty discussion. :( Nope. Definitely not. The problem for Nagios is that threading was added after the fact so nagios actually breaks some of the *strong* recommendations on what to do and what not to do in a threaded application after a fork(). The problem for *BSD and their thread implementation of the thread library is that Nagios actually works everywhere but on *BSD, and it *often* works there too, but not always. This often-but-not-always is usually a sign of a broken implementation, although exactly often-but-not-always is a sign of the errors you'll run into when you do what Nagios does post-fork(). I don't know of any other program that has the same problem on *BSD, but it would be interesting to see if there's a common pattern so one can pinpoint the exact pattern that causes the lock contention and races. It would, from a practical point of view, be best to patch it in the library, as that is a fix that would work for all possible future problems as well, although it's technically more correct to fix it in Nagios. Ugly discussion indeed. I'll try using a non SMP kernel to see it might help. If it doesn't this pretty much renders Nagios useless on FreeBSD. (Which makes me wonder why they even bother maintaining it in ports?) Out of curiousity, do you use passive checks, active checks or a mix of both in your setup? -- Andreas Ericsson [EMAIL PROTECTED] OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null Was there ever a solution found to this problem? -- Kyle Sexton - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Problems with FreeBSD and Nagios
Jonathan Call wrote: I scanned the mailing list trying to find a solution for this. I found a brief discussion where someone had the same problem but there was nothing really discussed what was potentially wrong. My system: Dual 2.8GHz P4 processors 4GB of RAM FreeBSD 6.1-RELEASE-p10 Running processes: Nagios 2.6 (installed from ports without embedded perl or nanosleep) One mysqld process for the nagiosweb utility A few NSCA daemon processes for passive checking A backup tool daemon Apache+modssl (latest from ports) Basic FreeBSD services (sshd, sendmail, etc.) Problem: Random service and host check control processes will lock up and 'spin' on the CPU. This is really bad when a host check does it because it brings all checks to a halt. It doesn't seem to even notice that all checks have gone stale. It will look like this in top: PID USERNAME THR PRI NICE SIZERES STATE C TIME WCPU COMMAND 94068 nagios 1 1160 7500K 6748K CPU2 0 727:37 30.15% nagios 94082 nagios 1 1160 7500K 6748K CPU2 0 734:28 32.55% nagios 94104 nagios 1 1160 7500K 6748K CPU2 0 845:21 37.42% nagios 75338 nagios 5 200 7500K 6776K kserel 0 91:33 0.00% nagios In this example the main nagios pid is 75338. The hung service and/or host processes are the other ones. The service checks are almost entirely custom scripts, but the host check is a standard check_ping that comes with the nagios program. Any ideas on how to figure out which service or host check is hung? Or how to deal with this problem at all? Host and service checks going into infinite loops wouldn't show up as Nagios processes in CPU spinlock, as the nagios check execution children just sit around and wait for the child to finish (or 60 seconds to pass in default config, before it kills it off). You've found a bug in Nagios which most likely was either introduced in the port of it, or is a result of library differences between FreeBSD and Linux. I wouldn't be all too surprised if it turns out that the FreeBSD pthread implementation disallows something that the Linux version allows. Note that this doesn't necessarily have to be a bug; Nagios doesn't use the pthread ABI in a way that is explicitly stated as safe, but the pthread implementation on Linux and most other unices are forgiving enough to make it work anyway. It's also possible that this bug only triggers on dual-CPU systems with a particular library installed, as some kinds of timing and race-conditions just doesn't happen on single-CPU systems. What happens if you do $ gdb --pid=$(pidof spinning-nagios-process) (gdb) bt ? -- Andreas Ericsson [EMAIL PROTECTED] OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Problems with FreeBSD and Nagios
nagios# gdb --pid=$74056 GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type show copying to see the conditions. There is absolutely no warranty for GDB. Type show warranty for details. This GDB was configured as i386-marcel-freebsd. /var/spool/nagios/rw/ is not a core dump: File format not recognized (gdb) bt No stack. (gdb) Given your ideas and some google work I seem to have found my problem: http://lists.freebsd.org/pipermail/freebsd-hackers/2005-August/013247.ht ml Not a pretty discussion. :( I'll try using a non SMP kernel to see it might help. If it doesn't this pretty much renders Nagios useless on FreeBSD. (Which makes me wonder why they even bother maintaining it in ports?) -Original Message- From: Andreas Ericsson [mailto:[EMAIL PROTECTED] Sent: Thursday, December 14, 2006 2:26 AM To: Jonathan Call Cc: nagios-users@lists.sourceforge.net Subject: Re: [Nagios-users] Problems with FreeBSD and Nagios Jonathan Call wrote: I scanned the mailing list trying to find a solution for this. I found a brief discussion where someone had the same problem but there was nothing really discussed what was potentially wrong. My system: Dual 2.8GHz P4 processors 4GB of RAM FreeBSD 6.1-RELEASE-p10 Running processes: Nagios 2.6 (installed from ports without embedded perl or nanosleep) One mysqld process for the nagiosweb utility A few NSCA daemon processes for passive checking A backup tool daemon Apache+modssl (latest from ports) Basic FreeBSD services (sshd, sendmail, etc.) Problem: Random service and host check control processes will lock up and 'spin' on the CPU. This is really bad when a host check does it because it brings all checks to a halt. It doesn't seem to even notice that all checks have gone stale. It will look like this in top: PID USERNAME THR PRI NICE SIZERES STATE C TIME WCPU COMMAND 94068 nagios 1 1160 7500K 6748K CPU2 0 727:37 30.15% nagios 94082 nagios 1 1160 7500K 6748K CPU2 0 734:28 32.55% nagios 94104 nagios 1 1160 7500K 6748K CPU2 0 845:21 37.42% nagios 75338 nagios 5 200 7500K 6776K kserel 0 91:33 0.00% nagios In this example the main nagios pid is 75338. The hung service and/or host processes are the other ones. The service checks are almost entirely custom scripts, but the host check is a standard check_ping that comes with the nagios program. Any ideas on how to figure out which service or host check is hung? Or how to deal with this problem at all? Host and service checks going into infinite loops wouldn't show up as Nagios processes in CPU spinlock, as the nagios check execution children just sit around and wait for the child to finish (or 60 seconds to pass in default config, before it kills it off). You've found a bug in Nagios which most likely was either introduced in the port of it, or is a result of library differences between FreeBSD and Linux. I wouldn't be all too surprised if it turns out that the FreeBSD pthread implementation disallows something that the Linux version allows. Note that this doesn't necessarily have to be a bug; Nagios doesn't use the pthread ABI in a way that is explicitly stated as safe, but the pthread implementation on Linux and most other unices are forgiving enough to make it work anyway. It's also possible that this bug only triggers on dual-CPU systems with a particular library installed, as some kinds of timing and race-conditions just doesn't happen on single-CPU systems. What happens if you do $ gdb --pid=$(pidof spinning-nagios-process) (gdb) bt ? -- Andreas Ericsson [EMAIL PROTECTED] OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Problems with FreeBSD and Nagios
Jonathan Call wrote: Given your ideas and some google work I seem to have found my problem: http://lists.freebsd.org/pipermail/freebsd-hackers/2005-August/013247.ht ml Not a pretty discussion. :( Nope. Definitely not. The problem for Nagios is that threading was added after the fact so nagios actually breaks some of the *strong* recommendations on what to do and what not to do in a threaded application after a fork(). The problem for *BSD and their thread implementation of the thread library is that Nagios actually works everywhere but on *BSD, and it *often* works there too, but not always. This often-but-not-always is usually a sign of a broken implementation, although exactly often-but-not-always is a sign of the errors you'll run into when you do what Nagios does post-fork(). I don't know of any other program that has the same problem on *BSD, but it would be interesting to see if there's a common pattern so one can pinpoint the exact pattern that causes the lock contention and races. It would, from a practical point of view, be best to patch it in the library, as that is a fix that would work for all possible future problems as well, although it's technically more correct to fix it in Nagios. Ugly discussion indeed. I'll try using a non SMP kernel to see it might help. If it doesn't this pretty much renders Nagios useless on FreeBSD. (Which makes me wonder why they even bother maintaining it in ports?) Out of curiousity, do you use passive checks, active checks or a mix of both in your setup? -- Andreas Ericsson [EMAIL PROTECTED] OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Problems with FreeBSD and Nagios
I scanned the mailing list trying to find a solution for this. I found a brief discussion where someone had the same problem but there was nothing really discussed what was potentially wrong. My system: Dual 2.8GHz P4 processors 4GB of RAM FreeBSD 6.1-RELEASE-p10 Running processes: Nagios 2.6 (installed from ports without embedded perl or nanosleep) One mysqld process for the nagiosweb utility A few NSCA daemon processes for passive checking A backup tool daemon Apache+modssl (latest from ports) Basic FreeBSD services (sshd, sendmail, etc.) Problem: Random service and host check control processes will lock up and 'spin' on the CPU. This is really bad when a host check does it because it brings all checks to a halt. It doesn't seem to even notice that all checks have gone stale. It will look like this in top: PID USERNAME THR PRI NICE SIZERES STATE C TIME WCPU COMMAND 94068 nagios 1 1160 7500K 6748K CPU2 0 727:37 30.15% nagios 94082 nagios 1 1160 7500K 6748K CPU2 0 734:28 32.55% nagios 94104 nagios 1 1160 7500K 6748K CPU2 0 845:21 37.42% nagios 75338 nagios 5 200 7500K 6776K kserel 0 91:33 0.00% nagios In this example the main nagios pid is 75338. The hung service and/or host processes are the other ones. The service checks are almost entirely custom scripts, but the host check is a standard check_ping that comes with the nagios program. Any ideas on how to figure out which service or host check is hung? Or how to deal with this problem at all? Jonathan Call Network Engineer - NTT/Verio 801.437.7476 - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null