Re: Putting a server into maintance

2008-03-21 Thread Augie Schwer
Ben,

Did you also change the list_state code in the Mon::Client code? I see
where in the mon.cgi you check $scheduler_status[2] but I don't see
anywhere where you set that or get list_state to return that.

--Augie

On Tue, Mar 18, 2008 at 5:20 PM, Ben Ragg [EMAIL PROTECTED] wrote:
 Augie Schwer wrote:
   putting things into disabled and then forgetting all about them,
   which is dangerous and annoying.
  
   Care to share that code?
  
  
   The hold feature sounds pretty interesting. At my site I find people
  Always happy to share :)

  Version we're using is...mon,v 1.22 2006/07/13 12:03:39 vitroth Exp $
  ...it's been run through perl tidy to clean it up a little, so the line
  numbers won't match up (hence I won't even bother ;)

  Attached our copy of the main mon program and mon.cgi


  Changes to mon...

  Under the global definitions, add a new array...

  my @HOLD_ALERTS;   # dont send alerts, 0) end, 1) start, 2) by, 3)
  reason


  In the main monitoring loop for ( ; ; ) { near the top add a check for
  an expired hold timer...

  for ( ; ; ) {
   debug( 1, $i . ( $STOPPED ?  (stopped) :  ) . \n );
   $i++;
   $tm = time;

   # Check if the Hold Timer has ended
   @HOLD_ALERTS = () if ( defined($HOLD_ALERTS[0])  $HOLD_ALERTS[0] 
  $tm );


  In sub doalert add a bit on to the if ($STOPPED)

   if ($STOPPED) {
 syslog( notice, ignoring alert for $group,$service because the
  mon scheduler is stopped );
 return;
   } elsif (@HOLD_ALERTS) {
 syslog( notice,
   ignoring alert for $group,$service because alerts are held until
   . localtime( $HOLD_ALERTS[0] ) );
 return;
   }


  In sub client_command add hold to the list of acceptable commands

   if (
 $l !~ /^(dump|login|disable|enable|quit|list|set|get|setview|getview|
 stop|start|loadstate|savestate|reset|clear|checkauth|

  reload|term|test|servertime|ack|version|protid|hold)(\s+(.*))?$/ix
 )
   {
 sock_write( $fh, 520 invalid command\n );
 return;
   }


  and also within sub client_command, define what the command hold
  actually does...

 #
 # hold
 #
   } elsif ( $cmd eq hold ) {
 my ( $period, $reason ) = split( /\s+/, $args, 2 );
 $period = 180 if ( $period  180 );
 $HOLD_ALERTS[1] = time;
 $HOLD_ALERTS[0] = $HOLD_ALERTS[1] + ( $period * 60 );
 $HOLD_ALERTS[2] = $clients{$cl}-{user};
 $HOLD_ALERTS[3] = $reason;
 sock_write( $fh, 220 Alerts on hold until  . localtime(
  $HOLD_ALERTS[0] ) . \n );

  ...that should be all the changes in the mon file itself.



  Also need to update the mon.cgi to make use of it...

  Under sub query_opstatus add another scheduler_status...

  if ( $scheduler_status[0] == 0 ) {
   $webpage-print( The scheduler on b$monhost:$monport/b is
  currently font color=$greenlight_colorrunning/font. );
  } elsif ( $scheduler_status[0] == 1 ) {
   my $pretty_sched_down_time = strftime %H:%M:%S %d-%b-%Y, localtime(
  $scheduler_status[1] );
   $webpage-print(brThe scheduler has been font
  color=$redlight_colorstopped/font since $pretty_sched_down_time.br\n);
  } elsif ( $scheduler_status[0] == 2 ) {
   my $pretty_sched_down_time = strftime %H:%M:%S %d-%b-%Y, localtime(
  $scheduler_status[1] );
   my $pretty_sched_up_time = strftime %H:%M:%S %d-%b-%Y, localtime(
  $scheduler_status[4] );

   $webpage-print(brThe scheduler is running, but alerts have been
  font color=$yellowlight_colorheld/font since
  $pretty_sched_down_time.br\n);
   $webpage-print(Mon will return to normal operation at
  $pretty_sched_up_timebr);
   $webpage-print(This hold was set by $scheduler_status[2] due to
  \$scheduler_status[3]\p);
  } else {#value is undef, scheduler cannot be contacted (or auth failure)
   $webpage-print(brfont color=$redlight_colorThe scheduler cannot
  be contacted at this time./fontbr\n);
  }


  And a new subroutine...

  sub mon_hold {

  my ($args) = @_;

  my $retval;
  my $conn = mon_connect;
  return 0 if $conn == 0;

  $retval = $c-hold( $args, ${ackcomment} );

  return $retval;
  }

  And under the main programs long list of elsif commands...

  elsif ($command eq mon_hold ) {
  setup_page(Alerts Hold);
  mon_hold($args);
  sleep 1;
  query_opstatus(summary);
  }


  And sub moncgi_custom_print_bar

  my $fubar = EOF;
  tr
   td colspan=3 align=center
 form method=post action=$url
  enctype=application/x-www-form-urlencoded
 input name=command value=mon_hold type=hidden
 Hold alerts for input name=args size=2 type=text minutes,
  reason:
 input name=ackcomment size=20 type=text
 input value=Hold type=submit
 /form
   /td
   td colspan=2  align=center
 form method=post action=$url
  enctype=application/x-www-form-urlencoded
 input name=command value=mon_hold type=hidden
 input name=args value=0 type=hidden
 input value=Remove Hold type=submit
 /form
   /td
  /tr

  EOF

 $webpage-print($fubar);


  Any questions feel free to sing out.

  Cheers,
  Ben







Re: Putting a server into maintance

2008-03-18 Thread Stephane Bortzmeyer
On Wed, Mar 12, 2008 at 12:07:38PM -0400,
 Ed Ravin [EMAIL PROTECTED] wrote 
 a message of 23 lines which said:

 In most cases, our engineers log into Mon and use the host disable
 or service disable to stop montoring the stuff that's about to go
 down, and re-enable them when the maintenance is over.
 
 Sometimes, we just ACK whatever's broken when Mon starts alarming.

The good thing about doing nothing when there is a planned
maintenance is that it allows you to test that monitoring indeed
works.

I had several times the bad experience of an undetected failure
because the monitoring had an hidden problem.

___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


Re: Putting a server into maintance

2008-03-18 Thread Chris Hoogendyk


Stephane Bortzmeyer wrote:
 On Wed, Mar 12, 2008 at 12:07:38PM -0400,
  Ed Ravin [EMAIL PROTECTED] wrote 
  a message of 23 lines which said:
   
 In most cases, our engineers log into Mon and use the host disable
 or service disable to stop montoring the stuff that's about to go
 down, and re-enable them when the maintenance is over.

 Sometimes, we just ACK whatever's broken when Mon starts alarming.
 

 The good thing about doing nothing when there is a planned
 maintenance is that it allows you to test that monitoring indeed
 works.

 I had several times the bad experience of an undetected failure
 because the monitoring had an hidden problem.

mon is just so quiet and minimal when things are running alright. 8-)

sometimes I feel a need to go look, or even to kick it, to reassure 
myself that it is alright itself.

at some point I plan to implement a backup server in another department 
and have the backups backup each other and the mons mon each other. Then 
I could maybe have mon issue a Good morning, Sysadmins! with a summary 
of things that have been checked and are running alright. It would come 
on right before NPR's Morning Edition (in the US -- for other locations, 
substitute appropriate national/regional/local morning news). If it 
could query the coffee maker as well, then we'd be all set. ;-)

---

Chris Hoogendyk

-
   O__   Systems Administrator
  c/ /'_ --- Biology  Geology Departments
 (*) \(*) -- 140 Morrill Science Center
~~ - University of Massachusetts, Amherst 

[EMAIL PROTECTED]

--- 

Erdös 4


___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


Re: Putting a server into maintance

2008-03-18 Thread Ben Ragg
Over time we've been slowly modifying the code a little and adding our 
own features.

Two we've found really useful... Ack All to ack everything in the 
current view and a hold feature... so we can stop alerts going out for 
up to 180 mins (but still see what's failed). The hold feature includes 
who put Mon in to hold and their reason. At the end of the 180mins (or 
timeframe specified less than that) Mon automatically comes out of hold 
and the alerts automatically resume, so someone can't accidentally leave 
it on hold like we could when we stopped the scheduler (which had the 
disadvantage of not knowing what was down).

Stephane Bortzmeyer wrote:
 On Wed, Mar 12, 2008 at 12:07:38PM -0400,
  Ed Ravin [EMAIL PROTECTED] wrote 
  a message of 23 lines which said:

   
 In most cases, our engineers log into Mon and use the host disable
 or service disable to stop montoring the stuff that's about to go
 down, and re-enable them when the maintenance is over.

 Sometimes, we just ACK whatever's broken when Mon starts alarming.
 

 The good thing about doing nothing when there is a planned
 maintenance is that it allows you to test that monitoring indeed
 works.

 I had several times the bad experience of an undetected failure
 because the monitoring had an hidden problem.

 ___
 mon mailing list
 mon@linux.kernel.org
 http://linux.kernel.org/mailman/listinfo/mon
   


-- 
Ben Ragg - Internode - Network Operations
150 Grenfell Street, Adelaide, SA, 5000
Phone: 13NODE Web: http://www.on.net

___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


Re: Putting a server into maintance

2008-03-18 Thread Augie Schwer
Darn auto-complete.


-- Forwarded message --
From: Augie Schwer [EMAIL PROTECTED]
Date: Tue, Mar 18, 2008 at 4:44 PM
Subject: Re: Putting a server into maintance
To: [EMAIL PROTECTED]


On Tue, Mar 18, 2008 at 12:22 PM, Chris Hoogendyk
 [EMAIL PROTECTED] wrote:
   at some point I plan to implement a backup server in another department
   and have the backups backup each other and the mons mon each other. Then
   I could maybe have mon issue a Good morning, Sysadmins! with a summary
   of things that have been checked and are running alright. It would come
   on right before NPR's Morning Edition (in the US -- for other locations,
   substitute appropriate national/regional/local morning news). If it
   could query the coffee maker as well, then we'd be all set. ;-)

 You should definitely have a second box monitoring mon; if your main
 mon box dies and you don't know about it in the middle of the night,
 then you also won't know about all the other stuff that may have just
 failed too.




 --
 Augie Schwer - [EMAIL PROTECTED] - http://schwer.us
 Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072



-- 
Augie Schwer - [EMAIL PROTECTED] - http://schwer.us
Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072

___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


Re: Putting a server into maintance

2008-03-18 Thread Augie Schwer
On Tue, Mar 18, 2008 at 2:06 PM, Ben Ragg [EMAIL PROTECTED] wrote:
  Two we've found really useful... Ack All to ack everything in the
  current view and a hold feature... so we can stop alerts going out for
  up to 180 mins (but still see what's failed). The hold feature includes
  who put Mon in to hold and their reason. At the end of the 180mins (or
  timeframe specified less than that) Mon automatically comes out of hold
  and the alerts automatically resume, so someone can't accidentally leave
  it on hold like we could when we stopped the scheduler (which had the
  disadvantage of not knowing what was down).

The hold feature sounds pretty interesting. At my site I find people
putting things into disabled and then forgetting all about them,
which is dangerous and annoying.

Care to share that code?


-- 
Augie Schwer - [EMAIL PROTECTED] - http://schwer.us
Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072

___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


Re: Putting a server into maintance

2008-03-12 Thread Ed Ravin
On Wed, Mar 12, 2008 at 09:44:59AM -0600, Michael Osburn wrote:
  How are most of you managing planned system 
  downtime?

In most cases, our engineers log into Mon and use the host disable
or service disable to stop montoring the stuff that's about to go
down, and re-enable them when the maintenance is over.

Sometimes, we just ACK whatever's broken when Mon starts alarming.

If I had a really big planned outage I would comment out big chunks of
the config file and restore it after the window.

 Or am I missing a feature to have mon check it's configuration 
 file and reload if it changes?

You are - look up reset Mon in the CGI or the API.  You can also
send Mon a HUP signal to make it reload its config.

___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon