RE: Starting
Any good reference on the web interface? (the one from the site, mon.lycos.com is dead). I believe the most commonly used interface is mon.cgi, maintained by Ryan Clark, available at http://moncgi.sourceforge.net/ Ryan also has a website at http://www.ryanclark.org. He started working on a newer version of the mon client a while ago as well. Tim ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
RE: Unable to pass options in config file.
Heres more information on the problem. We can tell that mon is automatically passing its standard option to the alert as a getopts call within the alert that looks for -s and a data field returns what the service name is that failed. So in the example below, the alert, instead of taking -s primary and assigning primary to that flag will instead use the service name of DHCP. That sounds like expected behavior to me (and is verified by mons man file entry). If, however, we change the line below to read: alert DHCPMonitor.alert q primary There is nothing that is registered for the -q option by getopts. Weve tried -x primary as well. If, however, we run the alert manually with that option we do get something in the -q flag. BTW, the _STANDARD_TIME_ reference below is a macro were using with M4. That part works ok. =) Any thoughts? Thanks, Tim From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Chris Stringer Sent: Tuesday, September 05, 2006 8:07 AM To: mon@linux.kernel.org Subject: FW: Unable to pass options in config file. Good morning, all. I was curious if anyone had any thoughts on the matter below? Again, to summarize, we are unable to pass options in the Mon configuration file to alerts. I do appreciate your time on the matter. Chris Stringer From: Chris Stringer Sent: Tue 8/29/2006 1:18 PM To: mon@linux.kernel.org Subject: Unable to pass options in config file. Hello, all. We have multiple locations with server pairs who check on themselves as well as each other, and also a virtual primary system. The scripts that we have writtenrequestthat the server to be checked be identified via an option, such as ./DHCPMonitor.alert -q primary.Is it possible to pass options to a called alert within the mon.m4 configuration? An example of what we're seeing is as follows: (From the mon.m4 file) service DHCP interval 1m monitor DHCPMonitor.monitor -s primary description Is the DHCP service running on the primary server? period _STANDARD_TIME_ startupalert trap.alert mainmonitor alert trap.alert mainmonitor alert DHCPMonitor.alert -q primary upalert trap.alert mainmonitor When options are passed to the monitor, everything works as intended. This behavior is limited to passing options to the alert. I'd appreciate your time and thoughts on the matter. Chris Stringer ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Question on Redistribute
Hi, folks. Were going to be running mon on over 1,000 servers (each one is monitoring things at a remote site). Each of these servers/sites are reporting in (via the redistribute command) to a Corporate/main monitoring server so we can be aware of a failure out in the remote site. This corporate site will expect alerts from each server monitor check (via the traptimeout command). All this is currently working correctly. The problem is that were going to need to turn the monitoring period for several of the remote site monitors in each location way up like checking every 10 seconds (i.e., interval 10s). That mean were going to see a huge increase in the number of traps were seeing at the corporate site. Is there some way to only redistribute alerts from the remote servers every 60 seconds, or perhaps another approach to the problem, like not using redistribute? Remote site configuration example: watch BRANCH_SERVER service DRBD interval 1m monitor DRBDCheck.monitor -s me description Is my DRBD working? redistribute trap.alert mainmonitor period wd {Mon-Sun} alert trap.alert mainmonitor upalert trap.alert mainmonitor service My_HB interval 1m monitor HACheck.monitor -s me description Is my heartbeat active? redistribute trap.alert mainmonitor period wd {Mon-Sun} alert trap.alert mainmonitor upalert trap.alert mainmonitor Corporate site configuration example: service DRBD description Is my DRBD working? traptimeout 2m service My_HB description Is my heartbeat active? traptimeout 2m Thanks, Tim ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
RE: Question on Redistribute
Thanks for all the interest and quick replies here. A bit more about what we're trying to do... Remote site stats: - There are going to be roughly 1200 remote locations (which could grow). - Each location has a pair of servers in it. - We have 2 servers in the remote site as we're doing a high-availability failover model. - We're running mon on both servers in the remote site as each server is setup to watch key services on itself and the other server. If a failure in a critical service/software component is detected, and the server is the primary one, mon will send an alert to that server to kill itself and the second one will kick in. - Each server's Mon instance is watching about 25 things checking each one every minute. - Every time any check is run on either mon server, we're redistributing it back to the corporate monitoring server. - We need to tune the failover model to detect things quicker, so about 15 of the traps need to be run every 10 seconds. - ((15 traps * 6 traps/minute) + 10 other traps/minute) = (100 traps per server per minute * 2 servers / location) = 200 traps per minute sent to the corporate monitoring server Corporate site stats: - Failures are not expected very often - maybe 1-2 server problems per day - In the case of a large network outage, we might see up to a couple of hundred servers go away. That would actually reduce load on the server as we won't be seeing traps from those locations. - There shouldn't be a lot of load generated from the corporate server needing to do something about alerts. If a failure is detected for any monitor for any site, we're going to send an email once every 4 hours. Load thoughts: - From a network perspective, load would be somewhat significant (200 bytes per trap * 200 traps/minute * 1200 sites * 8 bits/byte / 60 seconds) = 6.4 Mbps /second. - The server would be responding to (200 traps/min * 1200 sites / 60 seconds) = 4000 traps/second. That sounds like a whole lot to me. Corporate requirements: - Be aware of some service failed out in the remote site - Be aware of a network outage (i.e., no traps received w/in x minutes) - A centralized view of uptime, what's currently down outage history - Outage history has to be down to the each service / server level Thanks, Tim -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jim Trocki Sent: Thursday, August 24, 2006 9:15 AM To: David Nolan Cc: mon@linux.kernel.org Subject: Re: Question on Redistribute On Thu, 24 Aug 2006, David Nolan wrote: --On Thursday, August 24, 2006 08:21:16 -0500 Tim Carr [EMAIL PROTECTED] The problem is that we're going to need to turn the monitoring period for several of the remote site monitors in each location way up - like checking every 10 seconds (i.e., interval 10s). That mean we're going to see a huge increase in the number of traps we're seeing at the corporate site. Or we could implement a redistributeevery option, similar to alertevery. That wouldn't be too hard, but would take a little work. yeah the issue here is the processing and communication overhead of dealing with the traps sent remotely. it would make sense to batch up the 10s traps from the remote systems and send them out in a bundle say, once every minute, and that would, you know, save you 6x the processing overhead on the remote mon server, or at least give you a way to control the processing overhead to suit your needs. this use case might mean that it would make sense to move the remote trap stuff into the mon server itself, rather than implement it with the trap alert. the trap alert is a nice simple abstraction that works well for the simpler cases, and an elegant way of extending the functionality of mon without having to change the server code, but at the cost of efficiency. you would really want the ability to batch up only the trap transmissions rather than all alerts. for example, schedule a trap queue flush every minute performed by the mon server rather than in the trap alert. then this brings up the issue of trap processing overhead on the rx end. i wonder if the behavior would be acceptable by just processing the trap receptions serially, the way it is done now, or if it would require a change in processing method to scale it up efficiently. this probably requires much more thought and a better understanding of the usage scenario. ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
RE: Question on Redistribute
Very true - still way too much traffic. Maybe do something with the dependencies options stuff: - Would it be possible to just send one everything is ok trap for a new overall check? Maybe a new monitor script that queries itself to see if there are any existing problems and will alert based off that? - I'd also continue to send an alert per service if a new service problem is detected. - On the corporate server, I'd setup only setup one service per store entry that would have the traptimeout monitor (to watch for the network outages) but still have a service entry for each server to catch any of the specific service outage traps that would be received. That would drop us down to 2400 traps/minute (64kb / sec) + any outages traps. Make sense? Thanks, Tim -Original Message- From: David Nolan [mailto:[EMAIL PROTECTED] Sent: Thursday, August 24, 2006 2:40 PM To: Tim Carr; Jim Trocki Cc: mon@linux.kernel.org Subject: RE: Question on Redistribute --On Thursday, August 24, 2006 10:18:56 -0500 Tim Carr [EMAIL PROTECTED] wrote: 4000 traps/second. That sounds like a whole lot to me. Holy cr** thats a lot of traps. Wow, the interesting ways that mon gets deployed continue to amaze me... Even if you were only sending one trap per minute per service you would have: 25 service * 1 trap/minute * 2 servers * 1200 site = 6 traps/minute, or 1000 traps per second. That still *lot* of traps. Doing your bandwidth math shows that it still 1.6Mbps of trap traffic. I think you might want to make your mon setup more structured, with intermediate collection points that pass status changes only to your final collection point. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
RE: Getting 20 instead of spaces
Thanks, folks - that cleared up the problem. Tim -Original Message- From: Jim Trocki [mailto:[EMAIL PROTECTED] Sent: Thursday, July 27, 2006 8:05 AM To: Tim Carr Cc: Ed Ravin; mon@linux.kernel.org Subject: RE: Getting 20 instead of spaces On Wed, 26 Jul 2006, Tim Carr wrote: Is there a later version somewhere? In looking through the mailing lists, there are several proposed patches after that date, but I don't see anything that relates to this problem. use this one: ftp://ftp.kernel.org/pub/software/admin/mon/devel/mon-client-1.0.0pre2.t ar.gz ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Getting 20 instead of spaces
Has anyone run into a problem where mon, in all its outputs (monshow, monfailures, mon.cgi) is returning anything generated by a config file (i.e., the description) or a monitoring output (i.e., tcp.monitor), with a 20 instead of a space? For example, if my config file has a description in it like: description This is my description Then it shows up via monshow and mon.cgi (dumped directly to a file, not through a browser) as: This20is20my20description But if I create the config file to be: description This\ is\ my\ description then I get my normal description: This is my description That trick doesnt work for the alerts file, though. Escaping spaces still has the 20s in them. Any thoughts? Im running mon v 1.22 (7/13/2006) and monshow v 1.1 (6/29/2006). Thanks, Tim ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Nagios and MON integration
Has anyone got something like this going... - Were going to have an environment where weve got lots of servers out in the field (i.e., at over a thousand locations, two servers per locations) that are running mon to watch both themselves and their local standby server. If something goes awry out in the field, the remote servers are going to alert the centralized monitoring console back at the corporate office. - The corporate office servers purpose is to make sure that everyone out in the field is working correctly, track uptime statistics, and alert the IT staff if/when something goes wrong out in the field. Im thinking about setting up the corporate monitoring server to just expect mon traps every x minutes from each of the servers, and if it doesnt get a timely response, alert on that as well. What Im trying to get running back at corporate is: - A good looking display. The mon.cgi web page is perfect for each of the branch servers (were going to be running it out on each of the remote servers), but I dont think its going to scale well back at corporate trying to show a picture of all the checks it is expecting from 2000+ different servers. - A way to perform checks upon demand, but not try to be a full-fledged polling system. Basically, expect reports in from the servers and allow an ad-hoc request. - Good roll up views (i.e., geography based or company division based). - Statistics available for each branch, the environment as a whole, and hopefully a division/geography based breakout as well. Does anyone have anything like Nagios tied into mon servers for that kind of centralized console? Or is there perhaps another way of doing this? Suggestions are welcome. Thanks, Tim ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
RE: Problem getting traps to work correctly
That did the trick. Thanks for your help. Thanks, Tim -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of David Nolan Sent: Thursday, July 13, 2006 5:03 PM To: Tim Carr Cc: mon@linux.kernel.org Subject: Re: Problem getting traps to work correctly On 7/13/06, Tim Carr [EMAIL PROTECTED] wrote: Here's a bit more information on it. I've got the slave server configured for multiple services, each of them using the redistribute option: redistribute alert trap.alert mainmonitor If thats an exact quote you've got the option wrong. Its just redistribute trap.alert mainmonitor. On the master server, once I've reset it, none of those servers will ever go green/good in mon.cgi - they stay in blue/unchecked status. That sounds like you've still got the period based trap configuration in place. (Which would match with the above typo.) If thats not true, and the line above was a typo in the email not the configuration, then maybe the redistribute code in CVS is broken. Before I go investigate that possibility please confirm whether the line above was an exact quote from your config file. In the slave server, the history file shows this for an outage event: alert Store13-2 DRBD_Status 1152819579 /opt/mon/alert.d/trap.alert (mainmonitor) DRBD_Not_Running upalert Store13-2 DRBD_Status 1152819594 /opt/mon/alert.d/trap.alert (mainmonitor) DRBD_Not_Running This also indicates to me that your old alert/upalert configuration is still in place, because redistribute does not generate history entries, because doing so would bloat the history file on the slave server. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
RE: Problem getting traps to work correctly
Gotcha. I threw that in, and it seems to work correctly, except I can't tell if it is or not. I'm watching the log file, and it shows alerts being sent on an up/down event, but I'm not seeing alerts every 15s showing up when things are working correctly. Is that expected behavior? Thanks, Tim -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of David Nolan Sent: Thursday, July 13, 2006 7:06 AM To: mon@linux.kernel.org Subject: Re: Problem getting traps to work correctly --On Wednesday, July 12, 2006 16:30:08 -0500 Tim Carr [EMAIL PROTECTED] wrote: When mainmonitor gets one of the traps, I'll see this in /var/log/messages: Jul 11 16:20:04 monitor mon[2017]: trap received for undefined service type default/DRBD_Status ...but nothing will actually get kicked off and no mail is sent. Also, the mon.cgi program (running on mainmonitor) will stay in the blue/unchecked status. Looks like you've found a logic bug. The code to set the group service in handle_trap to default/default has an error which causes it to set the group but never the service. I just commited a fix for this to CVS. Jul 11 16:24:49 monitor mon[2017]: trap trap 0 from grp=default svc=DRBD_Status, sta=0 In this case you're not getting an alert because the status bit of the trap is set to 0, which is the OK status. It looks like the remote.alert in CVS was never updated when Mon starting using that field. trap.alert was rewritten... Since these two alerts server the same purpose I'm going to remove remote.alert from CVS. Any thoughts as to what's going on here? I'm trying to get this working: -An alert getting kicked off by the mainmonitor's system when it receives a trap; and -The mon.cgi program on mainmonitor showing an alert status once its received that trap. BTW, you might want to use the 'redistribute' config parameter for your traps, that will cause all status updates to propagate to your main mon server. That way you can see when the last test occurred at all times. From the current Mon manpage (in CVS): redistribute alert [arg...] A service may have one redistribute option, which is a special form of an an alert definition. This alert will be called on every service status update, even sequential success status updates. This can be used to integrate Mon with another monitoring system, or to link together multiple Mon servers via an alert script that generates Mon traps. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
RE: Problem getting traps to work correctly
Here's a bit more information on it. I've got the slave server configured for multiple services, each of them using the redistribute option: redistribute alert trap.alert mainmonitor On the master server, once I've reset it, none of those servers will ever go green/good in mon.cgi - they stay in blue/unchecked status. If I force a failure on one of those services, the master server will show that service going red. Once I re-enable the service, it will then show green. But the other services for that slave server will never change from unchecked state on the master server's mon.cgi. Also, if I then re-set the mon process on the master server, all items will go back to blue and will not change again unless I force a failure. Also, I'm logging all output to a file on both the master and slave server via these commands: logdir = /var/log/mon historicfile = /var/log/mon/history In the slave server, the history file shows this for an outage event: alert Store13-2 DRBD_Status 1152819579 /opt/mon/alert.d/trap.alert (mainmonitor) DRBD_Not_Running upalert Store13-2 DRBD_Status 1152819594 /opt/mon/alert.d/trap.alert (mainmonitor) DRBD_Not_Running On the master server, it will only log this for that same event: trapalert Store13-2 DRBD_Status 1152819578 /opt/mon/alert.d/mail.alert ([EMAIL PROTECTED]) DRBD_Not_Running Thanks, Tim -Original Message- From: David Nolan [mailto:[EMAIL PROTECTED] Sent: Thursday, July 13, 2006 2:32 PM To: Tim Carr; mon@linux.kernel.org Subject: RE: Problem getting traps to work correctly --On Thursday, July 13, 2006 14:20:38 -0500 Tim Carr [EMAIL PROTECTED] wrote: Gotcha. I threw that in, and it seems to work correctly, except I can't tell if it is or not. I'm watching the log file, and it shows alerts being sent on an up/down event, but I'm not seeing alerts every 15s showing up when things are working correctly. Is that expected behavior? Thanks, Tim I refer to the server that sends the traps as a slave server, and the server collecting the traps as the master server. Your master server should receive a trap on every status update on the slave server, i.e. a trap every 15s in your example. The master should only alert based on its alert behavior. This makes receving updates via traps almost functionally equivelant to other monitor tests that you run on your master server. If thats not the behavior you're seeing please let me know. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Problem getting traps to work correctly
Greetings, all. Im running on the latest files from CVS. Im trying to setup a test environment where one server (named branch-1) will alert a master server (mainmonitor) in the event of a problem. I can get the branch-1 server to recognize a problem and send an alert to the master server, and the master server will show that trap being received (in /var/log/messages), but I cant get the mainmonitor server to actually then kick off an alert action. Heres my pertinent configuration for the lower server (from mon.cf): watch OtherServer service DRBD_Status interval 15s monitor DRBDCheck.monitor -s you description Is\ DRBD\ working\ there? period wd {Mon-Sun} alert remote.alert -H mainmonitor I can see that trap being sent via that servers /var/log/messages: Jul 12 16:17:05 branch-1 mon[2649]: failure for OtherServer DRBD_Status 1152739025 DRBD_Not_Running Jul 12 16:17:05 branch-1 mon[2649]: calling alert remote.alert for OtherServer/DRBD_Status (/opt/mon/alert.d/remote.alert,-H mainmonitor) DRBD_Not_Running On the mainmonitor server, my mon.cf config is: watch default service default description Default trap service period wd {Mon-Sun} alert mail.alert [EMAIL PROTECTED] Ive got my auth.cf set to receive traps from anyone (* * *). When mainmonitor gets one of the traps, Ill see this in /var/log/messages: Jul 11 16:20:04 monitor mon[2017]: trap received for undefined service type default/DRBD_Status ...but nothing will actually get kicked off and no mail is sent. Also, the mon.cgi program (running on mainmonitor) will stay in the blue/unchecked status. Interestingly enough, if I change my mainmonitor servers mon.cf to be: watch default service DRBD_Status description Default trap service period wd {Mon-Sun} alert mail.alert [EMAIL PROTECTED] Ill get this in the /var/log/messages: Jul 11 16:24:49 monitor mon[2017]: trap trap 0 from grp=default svc=DRBD_Status, sta=0 ...but there is still nothing kicked off and the no mail gets sent. But...the mon.cgi program then shows the default service group to be in the green/good status. Any thoughts as to whats going on here? Im trying to get this working: - An alert getting kicked off by the mainmonitors system when it receives a trap; and - The mon.cgi program on mainmonitor showing an alert status once its received that trap. Thanks, Tim ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon