Re: Putting a server into maintance
Ben, Did you also change the list_state code in the Mon::Client code? I see where in the mon.cgi you check $scheduler_status[2] but I don't see anywhere where you set that or get list_state to return that. --Augie On Tue, Mar 18, 2008 at 5:20 PM, Ben Ragg [EMAIL PROTECTED] wrote: Augie Schwer wrote: putting things into disabled and then forgetting all about them, which is dangerous and annoying. Care to share that code? The hold feature sounds pretty interesting. At my site I find people Always happy to share :) Version we're using is...mon,v 1.22 2006/07/13 12:03:39 vitroth Exp $ ...it's been run through perl tidy to clean it up a little, so the line numbers won't match up (hence I won't even bother ;) Attached our copy of the main mon program and mon.cgi Changes to mon... Under the global definitions, add a new array... my @HOLD_ALERTS; # dont send alerts, 0) end, 1) start, 2) by, 3) reason In the main monitoring loop for ( ; ; ) { near the top add a check for an expired hold timer... for ( ; ; ) { debug( 1, $i . ( $STOPPED ? (stopped) : ) . \n ); $i++; $tm = time; # Check if the Hold Timer has ended @HOLD_ALERTS = () if ( defined($HOLD_ALERTS[0]) $HOLD_ALERTS[0] $tm ); In sub doalert add a bit on to the if ($STOPPED) if ($STOPPED) { syslog( notice, ignoring alert for $group,$service because the mon scheduler is stopped ); return; } elsif (@HOLD_ALERTS) { syslog( notice, ignoring alert for $group,$service because alerts are held until . localtime( $HOLD_ALERTS[0] ) ); return; } In sub client_command add hold to the list of acceptable commands if ( $l !~ /^(dump|login|disable|enable|quit|list|set|get|setview|getview| stop|start|loadstate|savestate|reset|clear|checkauth| reload|term|test|servertime|ack|version|protid|hold)(\s+(.*))?$/ix ) { sock_write( $fh, 520 invalid command\n ); return; } and also within sub client_command, define what the command hold actually does... # # hold # } elsif ( $cmd eq hold ) { my ( $period, $reason ) = split( /\s+/, $args, 2 ); $period = 180 if ( $period 180 ); $HOLD_ALERTS[1] = time; $HOLD_ALERTS[0] = $HOLD_ALERTS[1] + ( $period * 60 ); $HOLD_ALERTS[2] = $clients{$cl}-{user}; $HOLD_ALERTS[3] = $reason; sock_write( $fh, 220 Alerts on hold until . localtime( $HOLD_ALERTS[0] ) . \n ); ...that should be all the changes in the mon file itself. Also need to update the mon.cgi to make use of it... Under sub query_opstatus add another scheduler_status... if ( $scheduler_status[0] == 0 ) { $webpage-print( The scheduler on b$monhost:$monport/b is currently font color=$greenlight_colorrunning/font. ); } elsif ( $scheduler_status[0] == 1 ) { my $pretty_sched_down_time = strftime %H:%M:%S %d-%b-%Y, localtime( $scheduler_status[1] ); $webpage-print(brThe scheduler has been font color=$redlight_colorstopped/font since $pretty_sched_down_time.br\n); } elsif ( $scheduler_status[0] == 2 ) { my $pretty_sched_down_time = strftime %H:%M:%S %d-%b-%Y, localtime( $scheduler_status[1] ); my $pretty_sched_up_time = strftime %H:%M:%S %d-%b-%Y, localtime( $scheduler_status[4] ); $webpage-print(brThe scheduler is running, but alerts have been font color=$yellowlight_colorheld/font since $pretty_sched_down_time.br\n); $webpage-print(Mon will return to normal operation at $pretty_sched_up_timebr); $webpage-print(This hold was set by $scheduler_status[2] due to \$scheduler_status[3]\p); } else {#value is undef, scheduler cannot be contacted (or auth failure) $webpage-print(brfont color=$redlight_colorThe scheduler cannot be contacted at this time./fontbr\n); } And a new subroutine... sub mon_hold { my ($args) = @_; my $retval; my $conn = mon_connect; return 0 if $conn == 0; $retval = $c-hold( $args, ${ackcomment} ); return $retval; } And under the main programs long list of elsif commands... elsif ($command eq mon_hold ) { setup_page(Alerts Hold); mon_hold($args); sleep 1; query_opstatus(summary); } And sub moncgi_custom_print_bar my $fubar = EOF; tr td colspan=3 align=center form method=post action=$url enctype=application/x-www-form-urlencoded input name=command value=mon_hold type=hidden Hold alerts for input name=args size=2 type=text minutes, reason: input name=ackcomment size=20 type=text input value=Hold type=submit /form /td td colspan=2 align=center form method=post action=$url enctype=application/x-www-form-urlencoded input name=command value=mon_hold type=hidden input name=args value=0 type=hidden input value=Remove Hold type=submit /form /td /tr EOF $webpage-print($fubar); Any questions feel free to sing out. Cheers, Ben
Re: Putting a server into maintance
On Wed, Mar 12, 2008 at 12:07:38PM -0400, Ed Ravin [EMAIL PROTECTED] wrote a message of 23 lines which said: In most cases, our engineers log into Mon and use the host disable or service disable to stop montoring the stuff that's about to go down, and re-enable them when the maintenance is over. Sometimes, we just ACK whatever's broken when Mon starts alarming. The good thing about doing nothing when there is a planned maintenance is that it allows you to test that monitoring indeed works. I had several times the bad experience of an undetected failure because the monitoring had an hidden problem. ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Putting a server into maintance
Stephane Bortzmeyer wrote: On Wed, Mar 12, 2008 at 12:07:38PM -0400, Ed Ravin [EMAIL PROTECTED] wrote a message of 23 lines which said: In most cases, our engineers log into Mon and use the host disable or service disable to stop montoring the stuff that's about to go down, and re-enable them when the maintenance is over. Sometimes, we just ACK whatever's broken when Mon starts alarming. The good thing about doing nothing when there is a planned maintenance is that it allows you to test that monitoring indeed works. I had several times the bad experience of an undetected failure because the monitoring had an hidden problem. mon is just so quiet and minimal when things are running alright. 8-) sometimes I feel a need to go look, or even to kick it, to reassure myself that it is alright itself. at some point I plan to implement a backup server in another department and have the backups backup each other and the mons mon each other. Then I could maybe have mon issue a Good morning, Sysadmins! with a summary of things that have been checked and are running alright. It would come on right before NPR's Morning Edition (in the US -- for other locations, substitute appropriate national/regional/local morning news). If it could query the coffee maker as well, then we'd be all set. ;-) --- Chris Hoogendyk - O__ Systems Administrator c/ /'_ --- Biology Geology Departments (*) \(*) -- 140 Morrill Science Center ~~ - University of Massachusetts, Amherst [EMAIL PROTECTED] --- Erdös 4 ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Putting a server into maintance
Over time we've been slowly modifying the code a little and adding our own features. Two we've found really useful... Ack All to ack everything in the current view and a hold feature... so we can stop alerts going out for up to 180 mins (but still see what's failed). The hold feature includes who put Mon in to hold and their reason. At the end of the 180mins (or timeframe specified less than that) Mon automatically comes out of hold and the alerts automatically resume, so someone can't accidentally leave it on hold like we could when we stopped the scheduler (which had the disadvantage of not knowing what was down). Stephane Bortzmeyer wrote: On Wed, Mar 12, 2008 at 12:07:38PM -0400, Ed Ravin [EMAIL PROTECTED] wrote a message of 23 lines which said: In most cases, our engineers log into Mon and use the host disable or service disable to stop montoring the stuff that's about to go down, and re-enable them when the maintenance is over. Sometimes, we just ACK whatever's broken when Mon starts alarming. The good thing about doing nothing when there is a planned maintenance is that it allows you to test that monitoring indeed works. I had several times the bad experience of an undetected failure because the monitoring had an hidden problem. ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon -- Ben Ragg - Internode - Network Operations 150 Grenfell Street, Adelaide, SA, 5000 Phone: 13NODE Web: http://www.on.net ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Putting a server into maintance
Darn auto-complete. -- Forwarded message -- From: Augie Schwer [EMAIL PROTECTED] Date: Tue, Mar 18, 2008 at 4:44 PM Subject: Re: Putting a server into maintance To: [EMAIL PROTECTED] On Tue, Mar 18, 2008 at 12:22 PM, Chris Hoogendyk [EMAIL PROTECTED] wrote: at some point I plan to implement a backup server in another department and have the backups backup each other and the mons mon each other. Then I could maybe have mon issue a Good morning, Sysadmins! with a summary of things that have been checked and are running alright. It would come on right before NPR's Morning Edition (in the US -- for other locations, substitute appropriate national/regional/local morning news). If it could query the coffee maker as well, then we'd be all set. ;-) You should definitely have a second box monitoring mon; if your main mon box dies and you don't know about it in the middle of the night, then you also won't know about all the other stuff that may have just failed too. -- Augie Schwer - [EMAIL PROTECTED] - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 -- Augie Schwer - [EMAIL PROTECTED] - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Putting a server into maintance
On Tue, Mar 18, 2008 at 2:06 PM, Ben Ragg [EMAIL PROTECTED] wrote: Two we've found really useful... Ack All to ack everything in the current view and a hold feature... so we can stop alerts going out for up to 180 mins (but still see what's failed). The hold feature includes who put Mon in to hold and their reason. At the end of the 180mins (or timeframe specified less than that) Mon automatically comes out of hold and the alerts automatically resume, so someone can't accidentally leave it on hold like we could when we stopped the scheduler (which had the disadvantage of not knowing what was down). The hold feature sounds pretty interesting. At my site I find people putting things into disabled and then forgetting all about them, which is dangerous and annoying. Care to share that code? -- Augie Schwer - [EMAIL PROTECTED] - http://schwer.us Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072 ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Putting a server into maintance
On Wed, Mar 12, 2008 at 09:44:59AM -0600, Michael Osburn wrote: How are most of you managing planned system downtime? In most cases, our engineers log into Mon and use the host disable or service disable to stop montoring the stuff that's about to go down, and re-enable them when the maintenance is over. Sometimes, we just ACK whatever's broken when Mon starts alarming. If I had a really big planned outage I would comment out big chunks of the config file and restore it after the window. Or am I missing a feature to have mon check it's configuration file and reload if it changes? You are - look up reset Mon in the CGI or the API. You can also send Mon a HUP signal to make it reload its config. ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon