Re: [Nagios-users] check for the absence of a service
Use the negate plugin to reverse a check Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory -Original Message- From: Kyle Dippery [mailto:k...@engr.uky.edu] Sent: 09 July 2009 16:08 Is there an easy way to use nagios to check for the absence of a service? I want to have nagios monitor SMTP and a few other services on hosts that aren't supposed to be running them, and tell me if they suddenly get turned on. Is there a plugin for this, or a way to trick an existing plugin to make it work? I suppose if nothing else I can write a wrapper for check_smtp or check_tcp to swap the OK and CRITICAL return values, but it'd be much easier if someone else has already done it... Cheers, Kyle -- Kyle Dippery Engineering Computing Services Phone: (859) 257-1346 280 FPAT 0046 Fax: (859) 323-3848 UK - One Great Place to Work --- --- Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/Challenge ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- Scanned by iCritical. -- Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/Challenge ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] FW: NDO Utils Question
-Original Message- From: Christopher McAtackney [mailto:crist...@gmail.com] Sent: 25 June 2009 16:33 Results passing into the command pipe are stored. The relevant parameters are in nagios.cfg; they are external_command_buffer_slots and check_result_buffer_slots - by default these are set to 4096 (see documentation within the configuration file). Great, that's what I was hoping for. Do you have any experience of setting this buffer to much higher values? Not that I necessarily intend to, but it's always useful to know the effects of pushing the system to its limits. Routinely we have external_command_buffer_slots set to 40960 and check_result_buffer_slots set to 61440; this is because we have had problems with our SQL server (it gets very busy for other databases) which delays response to NDOUtils updates which fills up these buffers. You can see the current, high-water mark and setting for these parameters by running command nagiostats (the highest we have reached recently was about 25k buffer slots used). Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory -- Scanned by iCritical. -- ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] FW: NDO Utils Question
-Original Message- From: Christopher McAtackney [mailto:crist...@gmail.com] Sent: 25 June 2009 14:51 Hi everyone, I have a quick question about Nagios and NDOUtils I was hoping someone could answer. What happens if the database that NDO Utils is using becomes unavailable? (e.g. the server has crashed). The answer depends a lot on whether you are using Nagios v2 or Nagios v3; we are using Nagios v2.11. My understanding is that in Nagios v2, the code that communicates with the event broker module in single-threaded. Therefore a problem with the SQL server can jam up Nagios to the extent that it effectively stops running commands. In Nagios v3, threading has been rewritten and this problem no longer exists. I'm assuming Nagios will continue to monitor and send notifications as normal, is this correct? In Nagios v2 it seems that almost all activity is suspended when you are using NDOutils and the MySQL server is unavailable; this continues until the SQL server is restored. One solution is to restart Nagios without the broker_module. What about the service check results that would normally be passed to NDO Utils and then stored in the database? Are they queued somewhere? And if so, how is the capacity of this queue defined? If they are not queued, what happens? Will NDO Utils just throw an error for each result it tries to store in the database and fails? Will this affect the core Nagios process? Results passing into the command pipe are stored. The relevant parameters are in nagios.cfg; they are external_command_buffer_slots and check_result_buffer_slots - by default these are set to 4096 (see documentation within the configuration file). Hopefully someone can provide some insight. Hopefully this has Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory -- Scanned by iCritical. -- ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] funny disk space message
From: Jeremiah Jester [mailto:jeremiahjes...@gmail.com] Sent: 09 June 2009 18:33 Any one know why I'm getting this weird disk space message? * Nagios * Notification Type: PROBLEM Service: DISK SPACE Host: prod Address: (ip) State: WARNING Date/Time: Mon Jun 8 23:52:12 UTC 2009 Additional Info: DISK WARNING - free space: / 23146 MB (32 0node=99 What happens if you run the command (as userid nagios) on the system, as in: ssh -l root prod su - nagios /usr/lib/nagios/plugins/check_disk -w WLIM -c CLIM -p / where WLIM and CLIM are the warning and critical limits respectively. What is the result of the command ssh -l root prod df -h / ? Is the text given above exactly the output of the command ? Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory -- Scanned by iCritical. -- Crystal Reports - New Free Runtime and 30 Day Trial Check out the new simplified licensing option that enables unlimited royalty-free distribution of the report engine for externally facing server and web deployment. http://p.sf.net/sfu/businessobjects ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Mixing different versions of Nagios
With a master/slave(s) Nagios configuration, is it possible to run with Nagios version 3 on the master and Nagios version 2 on the slaves, given that the communication is by NSCA (slave returning results to master) and NRPE (master checking that slaves are running)? Is the other arrangement possible (i.e. Nagios 3 on slaves and Nagios 2 on the master)? Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory -- Scanned by iCritical. -- Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise -Strategies to boost innovation and cut costs with open source participation -Receive a $600 discount off the registration fee with the source code: SFAD http://p.sf.net/sfu/XcvMzF8H ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Nagios failed to notify and run event handler
Our configuration on the master server (running nagios 2.11) includes the NDUUtils module which writes Nagios data into a set of MySQL tables. The MySQL server is in a separate rack from the Nagios master server. Late yesterday evening (Sunday) there was a network switch problem which meant (among other things that you do not need to know about) that the Nagios process lost contact with the MySQL server. From that point on there were no notifications nor event-handlers run. My assumption is that the loss of contact to the MySQL server caused the single-threaded part of the Nagios process to stall until contact was restored; as a result notifications and event-handlers did not run as they are also in the single-threaded part of the code. Is my assumption correct? If not, can anyone suggest an alternative explanation? As far as I can tell the Nagios process continued to run as the log continued to record events - however log switching (at midnight) did not happen (also in the single-threaded part of the code). Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory -- Scanned by iCritical. - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] FW: Difference betwen down and unreachable
Copy for list -Original Message- From: Wheeler, JF (Jonathan) Sent: 18 July 2008 15:23 To: 'Stan Brown' Subject: RE: [Nagios-users] Difference betwen down and unreachable -Original Message- From: Stan Brown [mailto:[EMAIL PROTECTED] Sent: 18 July 2008 15:09 To: Wheeler, JF (Jonathan) Cc: Stewart Flood Subject: Re: [Nagios-users] Difference between down and unreachable On Fri, Jul 18, 2008 at 01:04:53PM +0100, Wheeler, JF wrote: -Original Message- From: [EMAIL PROTECTED] On Behalf Of stan Sent: 18 July 2008 12:26 I had a machine that was restored from an old backup tape, and did not have it's external facing NIC configured for a few days last week. Nagios reported it as down, rather than unreachable. How is this determined? Down means the individual system if down, that is, the host check has failed. Unreachable means not possible to test because the parent of the host has failed (maybe a switch) Thanks. I did not realize that Nagios was sophisticated to understand that a device could be dependent upon another device. Neat, I will look into how to configure this functionality Look at the parents directive under the host definition. Note that you can also define service dependencies as well - see the documentation for more details Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] FW: nagios actions
-Original Message- From: [EMAIL PROTECTED] On Behalf Of Melanie Pfefer Sent: 15 July 2008 09:04 Can nagios trigger an action when an alert is received? For example, if /var is at warning, can nagios execute a script that cleans the logs? Look at event handlers in the documentation; these do exactly what you require. Your service needs to specify the event handler script and also have event_handlers_enabled=1. Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Monitoring large (ish) numbers of servers with exceptions to the rules...
-Original Message- From: nagios-users On Behalf Of Matthew Macdonald-Wallace Sent: 17 June 2008 13:14 I currently help maintain and monitor around 50 servers across various parts of the UK using Nagios 2. At the moment, we have a configuration file for each host (%hostname%.cfg) and in that file we specify all the services for the named host. We are trying to reduce the number of configuration files as we take on more and more servers because there are a large number checks that we need to be rolled out to all servers and we feel that we are duplicating our workload. I'm open to ideas on how to achieve this however my thoughts were a setup along the lines of the following: - A master host template is created in which all services are defined for a host. - If a check does not need to be run for a given host (for example it is not a web server), a stanza is added to that particular host's config file that effectively tells nagios don't check for this service on this host I've tried defining all the services in a master templates file and this works perfectly however when I come to exclude certain services, I am at a loss on how to do it. Initially I tried adding a stanza with the same service name and register 0 as one of the options, however this didn't work. We have used HostGroups in the past to achieve a similar goal, however we ran into the issue that whilst we need to check the CPU Usage on all of the servers, a few of the servers that we monitor can take a lot more of a beating than the majority. This lead to us defining the CPU checks on a per-host basis as if we defined it separately from the hostgroup for the more powerful servers we presented with a load of errors regarding duplicate service names. I hope I've made myself clear on what we're after and I look forward to receiving your input on this. One thing that I use in the configuration that I maintain is to have something like this: define service{ use generic-hung-mounts hostgroup_name experiments hosts !lfc0448 contact_groups experiments } where lcg0448 is a host in host group experiments and I want to apply the generic-hung-mounts check to all hosts in that group except for lcg0448. This can lead to configuration like this: define service{ use check-pbs-offline hostgroup_name workers hosts !lcg0614,!lcg0617,!lcg0618,!lcg0626 contact_groups tier1a } define service{ use check-pbs-offline hosts lcg0614,lcg0617,lcg0618,lcg0626 contact_groups tier1a,grid-team } where the only difference is that the hosts in the second definition have a second contact group. HTH Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] No output from plugin
I have a new command which I have implemented as a Nagios plugin. Running the command as user nagios on the client gives the correct output (currently the string : test ()) and return code 2 which is what I require. Running the command /usr/bin/nagios/plugins/check_nrpe -H CLIENT -t 30 -c COMMAND on a Nagios server (in this case a slave server) also gives the correct output and return code. Running the command as a plugin gives the reply (No output from plugin). I have checked that the script puts its output to standard output and have added the line use lib /usr/lib/nagios/plugins; use plugins to the command script (as suggested by the FAQ) without any change. Does anyone have any suggestions to correct this problem ? Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Nagios alarms going from WARNING to CRITICAL state
We are running Nagios 2.11. When a check fails, Nagios configuration allows a number of retries of the check before the error becomes HARD; we find that this works well for checks which start OK and go CRITICAL. However does the retry mechanism apply when a check goes from WARNING to CRITICAL ? In other words, if a check is in WARNING state and then goes CRITICAL, does it first become CRITICAL/SOFT, or does it become CRITICAL/HARD straightaway ? Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] FW: Problems with nagios
-Original Message- Sent: 14 March 2008 10:52 To: nagios-users@lists.sourceforge.net In the past I have reported problems when our master server has failed with Out of memory problems caused by all server memory and swap space being used up. I have largely (but not completely) solved these by increasing the number of Command and Check result buffers. Regular readers of the list will remember that reported this problem which was affecting our nagios installation. I finally solved the problem about a month ago. The key is that I am using the NDOUTILS package to write the Nagios logs and configuration to a MySQL database. On the MySQL server there is a cron job which uses a program called mysqlhotcopy to create a snapshot of all of the MySQL databases. It does this by locking the tables whilst they are being copied. This causes the Nagios daemon on the master server to wait until the latest write request to MySQL is completed. Whilst the Nagios daemon is waiting the NSCA daemon is busy writing results to the command file which cannot be processed until the MySQL table locks are released. However the number of commands is too many to be processed before the command reaper starts again. This uses up command buffer slots and eventually the system runs out of memory and swap space, processes are killed by the OOM hander (Linux OS) and possibly the system crashes because all memory is used up. The solution to the problem was to exclude the nagios database from consideration by the mysqlhotcopy backup (there is a configuration option to do this). The lesson to learn is that when there is a problem you need to consider what is happening on all the computer systems involved in Nagios. Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Nagios service and host event handlers during service and host down time
We have started using Nagios service and host event handlers to trigger 24 hour callouts for our critical hosts and services. However today we had a situation when a host was put into downtime, but callouts were triggered for a number of services on this host. Does a host downtime period have any effect on service checks on that host ? Do we have to put the services into downtime as well if the host is still up and known to Nagios ? I did check the on-line documentation but I could not see any explanation of situations like this. Any help would be much appreciated. Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] nrpe timeout after 10 seconds
-Original Message- From: nagios-users On Behalf Of Meylikhov Sent: 13 March 2008 12:04 I have 4 linux servers that are monitored by nagios. Sometimes I get notification on my contact e-mail: CHECK_NRPE: Socket timeout after 10 seconds. Notifications stating that nrpe timed out come for ALL services and for ALL hosts randomly every 1-2-3 hours. Then I get another notification stating that everything is fine. This flapping events take place every 1-2-3-4-5 hours randomly. Nagios and monitoring servers are situated in the same network, therefore I have no intermediary between monitoring servers and nagios. Can you help me to diagnose what's wrong? Can I increase socket timeout variable on my nagios server? I think it could help. There is a timeout on the nrpe command which you can set using -t option (default is 10 secs). Try adding -t 30 to your nrpe command, probably in checkcommands.cfg Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Checking for a stale NFS connection
-Original Message- From: [EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: 29 January 2008 21:28 Anyone have an idea on how to have nagios check for a stale NFS network connection? We use the attached plugin Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory check_stale_nfs.sh Description: check_stale_nfs.sh - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Timeouts for send_nsca program
In /var/log/nagios/nagios.log on (at least) one of my slave servers, I am seeing messages like: [1191905959] Warning: OCSP command '/usr/lib/nagios/plugins/tier1/submit_check_result.sh HOST SERVICE_CHECK OK MESSAGE for service SERVICE NAME on host HOST timed out after 5 seconds There have been 712 occurrences today (so far). Can anyone offer an explanation ? As far as I can tell there is no configuration to increase the timeout limit (can it be increased by installing from source ?), but perhaps the message indicates another problem (network ?) Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] FW: Problem with NDOUtils 1.4b6 and MySQL
-Original Message- From: [EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: 02 October 2007 14:52 Thank you Mr Wheeler and Hugo for your advice. I have snipped the output of the suggested command from Mr. Wheeler. checking for mysql_store_result in -lmysqlclient... no *** MySQL library could not be located... ** Do you have the mysql-devel RPM installed ? This RPM contains the /usr/include/mysql files and would be required by the build. Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] FW: Problem with NDOUtils 1.4b6 and MySQL
-Original Message- From: [EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: 02 October 2007 14:17 I have been trying this for a day now and it is time to ask for some help. I have included the full output of the configure, as well as RPM output and directory listings. Any help would be greatly appreciated. It seems that NDO cannot find what it is looking for in regards to mysql yet AFAIK everything is there. Please advise if something is missing, or if I should compile mysql from source, or any other fix. Here is all the relevant information I can think of this morning: ~/ndoutils-1.4b6 # ./configure --with-mysql-lib=/usr/lib/mysql Use the following make command: ./configure --with-mysql-lib=/usr/lib/mysql --with-mysql-inc=/usr/include/mysql --disable-pgsql Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] FW: Network tuning for Nagios with slave servers
-Original Message- From: Andreas Ericsson [mailto:[EMAIL PROTECTED] Sent: 07 September 2007 13:03 Wheeler, JF (Jonathan) wrote: Our configuration is quite large (830 hosts, 160700+ services), You run more than 193 checks against each host? Good gods, you must be *really* curious about the state of those hosts :) Oops, I meant 16700+ services ! Nope, but you could try doing sysconf net.ipv4.tcp_fin_timeout=30 to halve the default tcp timeout in the kernel, which should reduce the number of half-open connections you have. Thanks for the suggestion. We are beginning to suspect a switch issue as there are other applications that are suffering packet loss in various ways. Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Network tuning for Nagios with slave servers
Our configuration is quite large (830 hosts, 160700+ services), so have implemented a master/slave configuration for Nagios (the Nagios servers are running Linux). The master server only runs checks if a check becomes stale; i.e. it should have been checked by a slave but no result has been received, but I find that (for example), in the last days log there are 80,000 + warning messages saying the master has run a check because it has become stale. On further investigation I find that on all of our 5 slaves the command netstat shows that there are a large number of TCP sockets in CLOSE_WAIT state (more . My question is, has anyone done any network tuning to improve Nagios network performance ? Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] FW: Network tuning for Nagios with slave servers
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: 07 September 2007 14:22 Could always try: net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.ipv4.tcp_timestamps = 0 net.ipv4.tcp_sack = 0 Thanks for the suggestions. We are beginning to suspect a switch issue as there are other applications that are suffering packet loss in various ways. Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Using contributed check
We have recently started trying to use the contributed plugin /usr/lib/nagios/plugins/contrib/check_snmp_process_monitor.pl to run checks from our Linux Nagios servers on a Solaris system. Using perl from userid nagios we get successful output: [EMAIL PROTECTED] ~]# su - nagios -sh-3.00$ cd /usr/lib/nagios/plugins/contrib/ -sh-3.00$ perl check_snmp_process_monitor.pl -H 130.246.183.131 -C public -e arrayd -w 0,3 -c 1,2 -s --memory --cpu OK - 1 process(es) found resembling 'arrayd'|count=1:memory=1216:cpu=0.08 However Nagios returns this text: **ePN /usr/lib/nagios/plugins/contrib/check_snmp_process_monitor.pl: Reference found where even-sized list expected at (eval 1) line 194,. Now I understand that the problem is that the code is not compatible with Embedded Perl Interpreter in Nagios, but an someone help me further understand and solve this problem. Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] FW: Monitoring service for every machine in hostgroupEXCEPT ONE
-Original Message- From: [EMAIL PROTECTED] On Behalf Of Kelly Jones Sent: 23 July 2007 02:55 I did not see any reply to this message, so here is my effort: I've created a hostgroup of 20 machines, and want to monitor 10 services on each machine (easy). I now want to monitor an 11th service on 19 of the 20 machines. What's the best way to do this? Two ugly ways I don't like: % Create a separate hostgroup for the 19 machines I do want to monitor. % Monitor the service on all 20 machines, but schedule infinite downtime for the service on the 20th machine. Is there a better way? Yes. In the configuration for the service, use an entry like this: define service{ use generic-service # From a template or add other options hostgroup_name mygroup # Group of 20 machines hosts !nothisone # Machine not to be tested (note the !) contact_groups us # } Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] FW: Problems with distributed setup, master overload?
-Original Message- From: nagios-users On Behalf Of Jeffrey Lensen Sent: 10 June 2007 08:28 I recently extend our distributed Nagios setup of 1 master and 2 distributed slaves (in which the master also had a lot of checks running), to 1 master and 5 distributed slaves (in which the master does no checking at all, except for host checks). This setup had 556 hosts and roughly 7000 service checks. Ever since I modified this setup, the Nagios master host has been giving me problems. The symptoms: - When starting both Nagios and NSCA, I see NSCA accepting checks in my logfiles, but none get processed by Nagios. - After a few minutes NSCA processes are starting to build up, increasing with 5-10 processes per second. In a few minutes it reaches a few thousand processes and the machine starts hanging. - Sometimes the number of Nagios processes start increasing, instead of the NSCA processes. Same result, the machine starts hanging. I have seen similar problems, though in my case (1 master, 2 slaves, 824 hosts, 16000+ services) the queued NSCA processes are eventually flushed. However the Nagios master server also suffers from memory leaks; it eventually (after a period of 1 - 5 days) crashes with a kernel panic because there is no free memory or reaches a state where the kernel has killed all useful processes (e.g. nagios, nsca, sshd, ntpd, etc) in attempt to cure OOM (Out Of Memory) problems. Interestingly trying to strace the first daughter nsca process seems to bring everything into life and the queue of NSCA processes quickly flushes. I have tried running nagios using option -s to get configuration recommendations and nagiostats to get usage information on both master and slave servers, but they do not reveal anything useful. My current plan is to introduce 3 more slave servers as I have heard that this helps. Any comments would be helpful to me as well. Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Service check timing out
I have a service check that takes more than 60 seconds to run. Despite calling check_nrpe with option -t 120, the check times out with the message NRPE: Command timed out after 60 seconds. The parameter service_check_timeout in nagios.cfg is set to 120 seconds as well. Any ideas ? Is there a maximum timeout in check_nrpe ? Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] FW: Question about Freshness Checking
There are several things that I do in situations like this, usually on the master server: a) Acknowledge the service or host problem which will prevent notifications b) change the configuration to suppress the service check for this host or remove the host from the configuration and restart Nagios on both host and slave (distributed) servers c) I believe that you can also schedule downtime for either host or specific service Of course in each situation above you have to remember to reverse the change once the service/host is available again. Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jeff Shumard - DefenseWeb Technologies Sent: 22 May 2007 16:17 To: nagios-users@lists.sourceforge.net Subject: [Nagios-users] Question about Freshness Checking I didn't here anything back on my issue or question. If anyone has information on this I would appreciate it. Thank you -Original Message- From: [EMAIL PROTECTED] I am running a Distributed Nagios configuration. On each of my passive service checks I am also doing freshness checks just encase the distributed host goes down and can't run the check. I am able to log into the distributed hosts Web Interface and shut off active checks if I don't want to run checks for a temporary amount of time on a specific hosts and it is service with one click to disable active checks for all services. This works with out any problems but once my freshness checks is hit the Centralized Nagios hosts starts doing the active checks because it doesn't receive an update from the Distributed Hosts. I am aware this is what should be happening and it is working great. Is there a way to disable the freshness check for all the services for a host just like you can for active checks? I know if I shut off receiving passive checks for one service this disables the freshness checks. Has someone configured a patch or know how to activate this feature to disable passive checks for all services on a host through the Nagios cgi. Jeff - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] FW: wild cards with exceptions?
-Original Message- From: nagios-users On Behalf Of dave stern - e-mail.pluribus.unum Sent: 08 May 2007 16:12 I'm trying to streamline my nagios config using wildcards. Unfortunately, not all services I wish to define via wildcard follows a clean set of rules. Is it possible to define a service with a host list of something like *,!linux1, !linux2 I suspect the answer is no and what I'd need to do is use a combination of hostgroups and hosts eg define service { hostgroup unix, ultrix, sco service_description } define service { host_name host1, host2, host3, host4 ... } Anyone find a way around this? I have found that within the same service definition I can use both hostgroup and host_name records, specifically I have definitions like: define service { service_description hostgroup unix, ultrix, sco host_name !host1, !host2, !host3, etc Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Master and slave servers for Nagios
As I have reported in the past I have 2 slave servers and a master server; all checks should be run from the slave servers and passed back to the master server. I have been recently trying the understand why the master server still has kernel Out of memory problems such that the kernel starts killing active processes and, in some cases, panics because there are no more processes to kill (this happens perhaps once or twice per week usually around 4:50 - 5:10 in the morning). As part of my investigations I have noticed that for a typical host 40% of tests are reported from the slave and 60% are run by the master. I can tell this because 40% of messages for this typical host in /var/log/nagios on the master server begin EXTERNAL_COMMAND and 60% of messages begin Warning:. My question is why this should be ? Here is a copy of nagios.log from the master server for one test of one host for today (so far): [1177369200] CURRENT SERVICE STATE: csflnx119;SPACE_TMP;OK;HARD;1;DISK OK - free space: /tmp 672 MB (70% inode=99%): [1177369894] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 41 seconds (threshold=1817 seconds). I'm forcing an immediate check of the service. [1177370925] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space: /tmp 672 MB (70% inode=99%): [1177373014] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 43 seconds (threshold=2052 seconds). I'm forcing an immediate check of the service. [1177374874] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 43 seconds (threshold=1816 seconds). I'm forcing an immediate check of the service. [1177376734] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 41 seconds (threshold=1817 seconds). I'm forcing an immediate check of the service. [1177377158] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space: /tmp 672 MB (70% inode=99%): [1177379494] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 33 seconds (threshold=2305 seconds). I'm forcing an immediate check of the service. [1177381354] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 39 seconds (threshold=1818 seconds). I'm forcing an immediate check of the service. [1177383214] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 43 seconds (threshold=1816 seconds). I'm forcing an immediate check of the service. [1177387073] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space: /tmp 660 MB (68% inode=99%): [1177389102] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 13 seconds (threshold=5089 seconds). I'm forcing an immediate check of the service. [1177390507] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space: /tmp 660 MB (68% inode=99%): [1177392635] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 11 seconds (threshold=2118 seconds). I'm forcing an immediate check of the service. [1177394495] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 39 seconds (threshold=1818 seconds). I'm forcing an immediate check of the service. [1177396362] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 36 seconds (threshold=1823 seconds). I'm forcing an immediate check of the service. [1177397210] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space: /tmp 660 MB (68% inode=99%): [1177399813] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 47 seconds (threshold=2562 seconds). I'm forcing an immediate check of the service. [1177401674] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 40 seconds (threshold=1818 seconds). I'm forcing an immediate check of the service. [1177403749] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 28 seconds (threshold=1931 seconds). I'm forcing an immediate check of the service. [1177404093] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space: /tmp 660 MB (68% inode=99%): [1177406037] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 42 seconds (threshold=1902 seconds). I'm forcing an immediate check of the service. [1177410112] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 184 seconds (threshold=2853 seconds). I'm forcing an immediate check of the service. [1177410863] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;csflnx119;SPACE_TMP;0;DISK OK - free space: /tmp 660 MB (68% inode=99%): [1177413485] Warning: The results of service 'SPACE_TMP' on host 'csflnx119' are stale by 30 seconds (threshold=2579 seconds). I'm forcing an immediate check of the service. [1177415948] Warning: The results of service
Re: [Nagios-users] Master and slave servers for Nagios
-Original Message- From: Jason Qualkenbush [mailto:[EMAIL PROTECTED] Sent: 25 April 2007 11:55 Wheeler, JF (Jonathan) wrote: As I have reported in the past I have 2 slave servers and a master server; all checks should be run from the slave servers and passed back to the master server. I have been recently trying the understand why the master server still has kernel Out of memory problems such that the kernel starts killing active processes and, in some cases, panics because there are no more processes to kill (this happens perhaps once or twice per week usually around 4:50 - 5:10 in the morning). As part of my investigations I have noticed that for a typical host 40% of tests are reported from the slave and 60% are run by the master. I can tell this because 40% of messages for this typical host in /var/log/nagios on the master server begin EXTERNAL_COMMAND and 60% of messages begin Warning:. My question is why this should be ? Here is a copy of nagios.log from the master server for one test of one host for today (so far): Sounds like this has to do more with the freshness of the passive check. If the master server thinks the check isn't fresh, it will then run an active check to see for itself. I'd tune in the freshness, and keep in mind the scheduling of the checks. If you configure your freshness to expire at five minutes, and the slave server schedules that check for once every six minutes, you are going to get behaviour like you mentioned. Thanks for your reply. However the tests are scheduled to run every 30 minutes on both master and slave servers (confirmed by checking in retention.dat file). If you look in the original message you will see that the master server is correctly running the command by freshness checking (Warning messages) every 30 minutes, but the slave results are at longer intervals (EXTERNAL messages) though roughly at some number of 30 minute intervals. What are the possibilities for results from command issued by the slave getting lost ? Why are OK results not recorded in the slave server logs ? Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] no output returned from plugin
-Original Message- From: nagios-users On Behalf Of Valdinger, Stephen (DOV, MSX) Sent: 25 April 2007 14:21 Any ideas as to what could be causing this??? Usually because the plugin has returned nothing on STDOUT. So has the plugin worked before ? Does it work if run by hand on the system being tested for user name nagios ? Is the test failing in an unusual way ? Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Option append_to_file in nsca.cfg
-Original Message- From: [EMAIL PROTECTED] On Behalf Of Marc Powell Sent: 13 March 2007 16:17 -Original Message- From: nagios-users On Behalf Of Wheeler, JF (Jonathan) Sent: Tuesday, March 13, 2007 9:59 AM To: nagios-users@lists.sourceforge.net Subject: [Nagios-users] Option append_to_file in nsca.cfg As I have said before my configuration consists of 1 master server and 2 slaves with about 700 hosts and 16000 checks. In the file nsca.cfg which configures the nsca daemon, there is an append_to_file option which is (by default) set to 0 for writing to the command file rather than 1 for appending to it. Please would someone explain why appending to the command file is deprecated. I ask because I can have several Semi-educated commentary follows -- The 'command file' is more properly named a 'command pipe'. It's not a real file and therefore appending to it makes no sense. A pipe is essentially a FIFO buffer. Data is written to it by one process and read by another in a sequential fashion. If the reading process can't keep up with the writing process, your kernel will buffer the writes up to a point depending on the OS. For linux kernel 2.6.11 the buffer was 4096 bytes. For 2.6.11, the buffer is 65535 bytes. Nagios also has its own internal buffers to help process the pipe faster. With nagios-2.7, these are controlled by the external_command_buffer_slots option in nagios.cfg. You can also control how often nagios checks for data in the pipe with the command_check_interval setting. You certainly want that to be -1 and not every 4 seconds. -1 tells nagios to check as often as possible. Depending on your check frequency, it sounds like nagios isn't able to keep up with your check submissions, almost certainly related to your checking the pipe every 4 seconds only. At ~100 bytes per check, you could only accept 40 results in 4 seconds before dropping. If you're doing 16,000 checks every 5 minutes that's ~213 check results every 4 seconds. You can do the math based on your actual sizes/intervals... Verify that you have a good amount of buffer slots (use nagiostats to see current utilization) and that you're checking external commands as fast as possible. I'm only doing 1/4 of the passive checks you are so you may be hitting limits that I haven't experienced yet but it doesn't appear so at this point. Sorry, I think that I have confused the discussion by not appreciating the difference between service_reaper_frequency (which is 4 secs) and command_check_interval (which is -1). After restarting nagios this morning (Thur 15/03 - the master server had panic'ed due to lack of memory), I issued the command wc -l /var/log/nagios/rw/nagios.cmd and got the answer 1003 (this is command pipe). If I understand you correctly, there were 1003 commands in the pipe waiting to be processed by the server (understandable as the master server had just restarted and the slaves had plenty of commands waiting to be processed), but the operation of command wc actually discarded these commands by reading through the pipe. At present I am running nagios 2.6. I want to upgrade to nagios 2.8, but as I also use ndoutils I need to compile the latest version of that and update the SQL tables that it writes. If the problem does not go away with the latest version of the server, I will raise the issues again, but nay other comments would be much appreciated. Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Option append_to_file in nsca.cfg
As I have said before my configuration consists of 1 master server and 2 slaves with about 700 hosts and 16000 checks. In the file nsca.cfg which configures the nsca daemon, there is an append_to_file option which is (by default) set to 0 for writing to the command file rather than 1 for appending to it. Please would someone explain why appending to the command file is deprecated. I ask because I can have several nsca processes running every second; if each of them writes to the command file, the output from previous nsca processes has been lost; this would explain why my master server issues so many tests itself because test results become stale. I should add that the command file is a pipe and it is reaped every 4 seconds. Perhaps I am misunderstanding something, about the nature of a pipe ? Or perhaps the documentation in the configuration file is misleading/wrong ? Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Out of memory failures on Nagios master server
My configuration has a master server and 2 slave servers with about 730 hosts and 16000 service checks. All our systems are running Linux. For some time now the master server has been running out of memory between 4:50 and 5:00 such that the server either kernel panics (rarely) or it kills all useful processes. To try and investigate the problem I have been running at commands to run vmstat 15 160 and date; ps -ef; sleep 15 (160 times) to record system activity at 15 second intervals for 40 minutes, i.e. from 4:30 until 5:10. This has revealed that the problem is caused by a) nsca processes starting and not being completed (today's maximum count was 4447) until they all suddenly complete at about 4:50. During this time vmstat shows that memory usage increases slowly, but it is all released when the nsca processes run. About 10 minutes later there are many separate nagios processes which do not complete (183); as the nagios process is quite large this fills system memory and swap space which effectively kills the system. You might think, given the time that this is happening, that this is affected by cron, but for this morning I had retimed cron.daily to run at 10:02 rather than 4:02. Has anyone seen anything like this ? I can say from the master server logs that no tests seem to be recorded from about 4:00 onwards; if they system survives they start after that. Any help would be appreciated. The server is a blade server with a single CPU but it is running with hyper-threading on (if that makes a difference); the kernel is 2.6.0-42.0.8 Any suggestions would be appreciated. Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] FW: Reports using NDOutils?
-Original Message- From: nagios-users On Behalf Of Marcel Mitsuto Fucatu Sugano Sent: 15 February 2007 14:58 On Thu, 2007-02-15 at 10:17 +, Wheeler, JF (Jonathan) wrote: -Original Message- From: nagios-users On Behalf Of Marcel Mitsuto Fucatu Sugano Sent: 14 February 2007 19:27 (big snip) 2) Is it crazy to think I can keep *all* the NDO data forever? (~500 hosts / 6000+ srvcs) Well, considering that only state changes matters, it isn't that crazy. The only place where I have had to do anything is with the logentries table which (in our case) has written more records than is allowed by MySQL and sometimes generates MySQL errors. Deleting old entries solves the problem (I have a script that deletes entries more than 6 weeks old). Deleting old entries didn't wacks historical state change data? I do not see a need to keep the log data for more than six weeks in the SQL tables; these are separate from the log files on the Nagios server. Note that there is no cleanup of the Nagios logs (as far as I am aware), so these need to be cleared out every so often as well. I have a separate script which compresses all log files except the last six and only keeps 190 files (about 6 months of data) in the log archives directory. Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] FW: Reports using NDOutils?
-Original Message- From: nagios-users On Behalf Of Marcel Mitsuto Fucatu Sugano Sent: 14 February 2007 19:27 On Wed, 2007-02-14 at 09:58 -0800, Trask wrote: Are there any projects, addons, or home-made scripts out there that people are using that pulls data from the NDO output and creates reports? I've done a good bit of searching and haven't found anything, but it seems like such a logical thing to have that I figure someone has done this already. I am waiting for same sort of thing as well. I am doing some researching around NDOUtils too, 'cause I'll need Nagios to watch over Service Level Agreement thresholds. I have a PHP script which gets information from the NDOUtils MySQL tables to display machine status on a web page (we have a home-grown script which provides a single page display of our farm of 800+ servers). I also plan to write scripts that will get plain-text output from the MySQL tables for use when administrators do not have access to the Nagios web pages (snip) 2) Is it crazy to think I can keep *all* the NDO data forever? (~500 hosts / 6000+ srvcs) Well, considering that only state changes matters, it isn't that crazy. The only place where I have had to do anything is with the logentries table which (in our case) has written more records than is allowed by MySQL and sometimes generates MySQL errors. Deleting old entries solves the problem (I have a script that deletes entries more than 6 weeks old). (snip) Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] FW: Memory leaks
-Original Message- From: [EMAIL PROTECTED] On Behalf Of Andreas Ericsson Sent: 24 January 2007 11:15 Tobias Klausmann wrote: (big snip) For vanilla Nagios, at least it's clear that in whatever way memory is wasted, it also slows Nagios down - a possibility would be a linked list that is walked and gets appended over and over. But I guess those with knowledge of the inner workings of Nagios have more clue about this than I do. Anyone wanting to look into it should probably take a look at the event scheduling queue. I have also been experiencing memory leaks, such that the kernel has been taking drastic action by killing processes starting with nagios and often including httpd, sshd etc. This all seems to happen at about 4:45 every morning. A reboot solves the problem and everything starts up again, but yesterday I decided to reboot using the single processor kernel (most of our nodes are dual processor, some are dual core as well) and there is no sign of a memory leak today ! Does that give anyone any clues ? Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] FW: Memory leaks
-Original Message- From: nagios-users On Behalf Of Tobias Klausmann Sent: 23 January 2007 15:32 Nagios 2.6 and 2.5 have memory leaks. They are not that big that within hours your machine will be swapping, but they degrade performance in other ways. I have also had problems with memory leaks, such that the kernel (2.6.0-42.0.3) reaches the stage of killing processes to try to preserve the system. In my experience the first processes killed are nagios and nsca. Our configuration is relatively large with just under 16,000 services and 750 hosts. As a consequence we run two slave servers which run the checks and report to the master; on the master all checks are passive except local checks. We have only seen the out of memory problems on the master. I had thought that the problems were caused by NagiosGrapher which we were running, but were not using; certainly the problem was reduced by removing that process from the mix. For us the problem seems to start (according to the message log) at about 4:45 in the morning, so perhaps there is another factor as well (cron jobs ?). Any input would be welcome, though I will continue to investigate as I have time. Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] FW: log rotation - how many files are kept?
-Original Message- From: nagios-users On Behalf Of Stijn Gruwier Sent: 18 January 2007 07:37 I'm aware that nagios is able to copy the nagios.log file to the archives directory on an hourly/daily/weekly/monthly basis. It seems that nagios keeps that files forever since I've got 11 archived weekly logs. But the word 'rotation' suggests that at some time the old ones are removed and replaced by newer logs. Is this the case? I searched the mailing list and the documentation but I couldn't find the answer. I also came across this problem and have written a script to organise our archived logs. At present I run it manually, but it could be a cron script. What is does is to keep up to 180 logs (1 per day for 6 months), but all but most recent 5 are zipped. The numbers in the previous sentence are parameters at the head of the script. Unfortunately the form of the name of the archived logs is not suitable for processing with standard logrotate. I am happy to let others use this script if a) no one has anything better, and b) if someone can tell me where to submit it. Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] FW: NDOUtils and NDO2BD
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jeff Sullivan Sent: 08 December 2006 19:22 Has anyone created an interface that uses data from the NDOUtils package? I am have it all setup and logging to MySql. I am in need of a simple interface for tier 1 support personnel. If you are talking about an HEP Tier1 site, then we are also one. I have adapted a script that we already had to issue some MySQL queries to the NDOutils tables. I would be happy for you to see them if that that would help. Jonathan Wheeler Tier1A Service Team e-Science Centre Rutherford Appleton Laboratory - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] FW: FW: NDOUtils and NDO2BD
-Original Message- From: Jeff Sullivan [mailto:[EMAIL PROTECTED] Sent: 12 December 2006 14:01 To: Wheeler, JF (Jonathan) Subject: Re: [Nagios-users] FW: NDOUtils and NDO2BD That would be great.. Those darn NDO2DB tables really need a dictionary.map.. These queries are extracts from a PHP script; note that the database name nagios is included because there are other queries in the script using a different database. The first query is checking for the status of a host (host name in $SHORT[$n]): $got = mysql_query(select last_state_change from nagios.ndo_hoststatus, . nagios.ndo_objects where host_object_id=object_id and . name1='.$SHORT[$n].' and nagios.ndo_hoststatus.current_state=1); if ($got and mysql_num_rows($got)) { $st = down; $txt = $node ($st - Nagios); All it is doing is checking if there is a record in ndo_hoststatus for the host where ndo_hoststatus is 1; host_object_id is a field in ndo_hoststatus which matches object_id in ndo_objects; name1 is the name of the host from ndo_objects The second query is doing something similar for alarms for hosts which are not down: # Check for Nagios alarms if system is not down if (strncmp($st, down, 4) != 0) { $got = mysql_query(select output from nagios.ndo_servicestatus, . nagios.ndo_objects where objecttype_id=2 and name1='.$SHORT[$n].'. and current_state=2 and service_object_id=object_id); if ($got and mysql_num_rows($got)) { $st .= _a; } In this query object_id, objecttype_id and name1 are fields in ndo_objects (you need both objecttype_id and name1 because there is a multi-field index built on objecttype_id and name1 in that order); current_state and service_object_id are fields in ndo_servicestatus This third query is extracting all the alarms for a host (this is a different script so $SHORT is not an array here): $got = mysql_query(select current_state, output, unix_timestamp() - . unix_timestamp(last_hard_state_change) from nagios.ndo_objects, . nagios.ndo_servicestatus where current_state!=0 and . service_object_id=object_id and name1='.$SHORT.'); if ($got and mysql_num_rows($got)) { print div class=\sub\Alarms for .htmlspecialchars($NODE)./div\n; $warns = $crits = $unkns = ; while ($r = mysql_fetch_row($got)) { $txt1 = tr valign=\top\tdspan class=; $txt2 = /span/td td$r[1]/td td nowrapspan class=\time\. prettytime($r[2])./span/td/tr\n; switch ($r[0]) { case 1: $warns .= $txt1.\warn\WARNING.$txt2; break; case 2: $crits .= $txt1.\crit\CRITICAL.$txt2; break; case 3: $unkns .= $txt1.\unkn\UNKNOWN.$txt2; break; default: $unkns .= $txt1.\unkn\UNKNOWN (bad type).$txt2; } } print divtable border=\0\ cellpadding=\0\ cellspacing=\2\ . width=\100%\\n$crits$warns$unkns/table/div\n; } I hope that this helps. Please ask for more explanations if required. Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Viewing a subset of systems checked by Nagios
I have a Nagios installation with ~700 hosts and ~11000 services. What we would like to do is to allow some users to view (via the web interface) only a subset of the systems being monitored; in the current instance this is just one host, but there could be other instances requiring a number of hosts . Is this possible ? I suspect not, but any comments would be useful. I am aware of the possibility that most users who have access to the web view should not have the ability to run commands. Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Archive logs
What do people do about their archive logs ? I am running a configuration on Scientific Linux with nearly 600 hosts and 13000+ services which generates quite large log files. As far as I can tell the logs are moved to the archive and retained there indefinitely; my /var partition is now getting quite full. I have tried using logrotate, but the log file names do not seem to allow logrotate to work correctly. A browse through the mailing list archives does not show anyone else asking about this problem. Any suggestions ? Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Question about NRPE configuration file
-Original Message- From: Jason Martin [mailto:[EMAIL PROTECTED] Sent: 14 August 2006 17:11 On Mon, Aug 14, 2006 at 11:25:40AM +0100, Wheeler, JF (Jonathan) wrote: I am in the process of migrating from a configuration with a single Nagios server to one with a master and a slave server. As part of this migration I have updated the NRPE configuration that is installed on the clients to include both hosts as allowed_hosts for NRPE calls. However I noticed that at NRPE restart, the following messages are issued: Aug 14 09:24:15 NODENAME nrpe[2592]: Unknown option specified in config file '/etc/nagios/nrpe.cfg' - Line 41 Aug 14 09:24:15 NODENAME nrpe: nrpe startup succeeded Aug 14 09:24:15 NODENAME nrpe[2593]: Starting up daemon Aug 14 09:24:15 NODENAME nrpe[2593]: Warning: Daemon is configured to accept command arguments from clients! Line 41 of /etc/nagios/nrpe.cfg is the allowed_hosts line which reads allowed_hosts=III.III.III.111, III.III.III.222 Try removing the space after the comma. -Jason Martin Thanks for the suggestion. I tried removing the space, but the message remains. However it is clear from testing the clients that both configurations work anyway ! You may (all) gather that I am just starting with distributed monitoring, so I am learning as I go. Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Question about NRPE configuration file
I am in the process of migrating from a configuration with a single Nagios server to one with a master and a slave server. As part of this migration I have updated the NRPE configuration that is installed on the clients to include both hosts as allowed_hosts for NRPE calls. However I noticed that at NRPE restart, the following messages are issued: Aug 14 09:24:15 NODENAME nrpe[2592]: Unknown option specified in config file '/etc/nagios/nrpe.cfg' - Line 41 Aug 14 09:24:15 NODENAME nrpe: nrpe startup succeeded Aug 14 09:24:15 NODENAME nrpe[2593]: Starting up daemon Aug 14 09:24:15 NODENAME nrpe[2593]: Warning: Daemon is configured to accept command arguments from clients! Line 41 of /etc/nagios/nrpe.cfg is the allowed_hosts line which reads (with context): # ALLOWED HOST ADDRESSES # This is a comma-delimited list of IP address of hosts that are allowed # to talk to the NRPE daemon. # # NOTE: The daemon only does rudimentary checking of the client's IP # address. I would highly recommend adding entries in your # /etc/hosts.allow file to allow only the specified host to connect # to the port you are running this daemon on. # # NOTE: This option is ignored if NRPE is running under either inetd or xinetd allowed_hosts=III.III.III.111, III.III.III.222 where III.III.III.111 and III.III.III.222 are the IP addresses of the Nagios servers Is the error message (from /var/log/messages) misleading ? Or is there an error in the configuration ? Any help would be appreciated Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null