RE: Starting

2006-09-08 Thread Tim Carr
 Any good reference on the web interface? (the
 one from the site, mon.lycos.com is dead).

I believe the most commonly used interface is mon.cgi, maintained by
Ryan Clark, available at http://moncgi.sourceforge.net/

Ryan also has a website at http://www.ryanclark.org.  He started working
on a newer version of the mon client a while ago as well.

Tim

___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


RE: Unable to pass options in config file.

2006-09-05 Thread Tim Carr








Heres more information on the
problem. We can tell that mon is automatically passing its standard option
to the alert as a getopts call within the alert that looks for -s
and a data field returns what the service name is that failed. So in the
example below, the alert, instead of taking -s primary and
assigning primary to that flag will instead use the service name
of DHCP. That sounds like expected behavior to me (and is
verified by mons man file entry). If, however, we change the line
below to read:



 alert
DHCPMonitor.alert q primary



There is nothing that is registered for
the -q option by getopts. Weve tried -x
primary as well. If, however, we run the alert manually with that
option we do get something in the -q flag. 



BTW, the _STANDARD_TIME_
reference below is a macro were using with M4. That part works ok.
=)



Any thoughts?





Thanks,

Tim











From:
[EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Chris Stringer
Sent: Tuesday, September 05, 2006
8:07 AM
To: mon@linux.kernel.org
Subject: FW: Unable to pass
options in config file.









Good morning, all.











I was curious if anyone had any thoughts on the matter
below? Again, to summarize, we are unable to pass options in the Mon
configuration file to alerts. I do appreciate your time on the matter.















Chris Stringer















From: Chris
Stringer
Sent: Tue 8/29/2006 1:18 PM
To: mon@linux.kernel.org
Subject: Unable to pass options in
config file.









Hello, all.











We have multiple locations with server pairs who check on
themselves as well as each other, and also a virtual primary
system. The scripts that we have writtenrequestthat the
server to be checked be identified via an option, such as
./DHCPMonitor.alert -q primary.Is it possible to pass
options to a called alert within the mon.m4 configuration? 











An example of what we're seeing is as follows:











(From the mon.m4 file)





service DHCP
 interval 1m
 monitor DHCPMonitor.monitor -s
primary
 description Is the DHCP service
running on the primary server?
 period _STANDARD_TIME_
 startupalert
trap.alert mainmonitor
 alert trap.alert
mainmonitor
 alert
DHCPMonitor.alert -q primary
 upalert trap.alert
mainmonitor











When options are passed to the monitor, everything works as
intended. This behavior is limited to passing options to the alert.











I'd appreciate your time and thoughts on the matter.











Chris Stringer


















___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


Question on Redistribute

2006-08-24 Thread Tim Carr








Hi, folks.



Were going to be running mon on over 1,000 servers (each
one is monitoring things at a remote site). Each of these servers/sites
are reporting in (via the redistribute command) to a
Corporate/main monitoring server so we can be aware of a failure out in the
remote site. This corporate site will expect alerts from each server
 monitor check (via the traptimeout command). All this
is currently working correctly.



The problem is that were going to need to turn the
monitoring period for several of the remote site monitors in each location way
up  like checking every 10 seconds (i.e., interval 10s).
That mean were going to see a huge increase in the number of traps were
seeing at the corporate site. 



Is there some way to only redistribute alerts from the
remote servers every 60 seconds, or perhaps another approach to the problem,
like not using redistribute?







Remote site configuration example:



watch BRANCH_SERVER

 service DRBD

 interval 1m

 monitor
DRBDCheck.monitor -s me

 description Is my
DRBD working?

 redistribute
trap.alert mainmonitor

 period wd
{Mon-Sun}


alert trap.alert mainmonitor


upalert trap.alert mainmonitor

 service My_HB

 interval 1m

 monitor
HACheck.monitor -s me

 description Is my
heartbeat active?

 redistribute trap.alert
mainmonitor

 period wd
{Mon-Sun}


alert trap.alert mainmonitor


upalert trap.alert mainmonitor





Corporate site configuration example:



 service DRBD

 description Is my
DRBD working?

 traptimeout 2m

 service My_HB

 description Is my
heartbeat active?

 traptimeout 2m





Thanks,

Tim








___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


RE: Question on Redistribute

2006-08-24 Thread Tim Carr
Thanks for all the interest and quick replies here.  A bit more about
what we're trying to do...

Remote site stats:
- There are going to be roughly 1200 remote locations (which could
grow).
- Each location has a pair of servers in it.
- We have 2 servers in the remote site as we're doing a
high-availability  failover model.
- We're running mon on both servers in the remote site as each server is
setup to watch key services on itself and the other server.  If a
failure in a critical service/software component is detected, and the
server is the primary one, mon will send an alert to that server to kill
itself and the second one will kick in.
- Each server's Mon instance is watching about 25 things  checking each
one every minute.
- Every time any check is run on either mon server, we're redistributing
it back to the corporate monitoring server.
- We need to tune the failover model to detect things quicker, so about
15 of the traps need to be run every 10 seconds.
- ((15 traps * 6 traps/minute) + 10 other traps/minute) = (100 traps per
server per minute * 2 servers / location) = 200 traps per minute sent to
the corporate monitoring server

Corporate site stats:
- Failures are not expected very often - maybe 1-2 server problems per
day
- In the case of a large network outage, we might see up to a couple of
hundred servers go away.  That would actually reduce load on the server
as we won't be seeing traps from those locations.
- There shouldn't be a lot of load generated from the corporate server
needing to do something about alerts.  If a failure is detected for any
monitor for any site, we're going to send an email once every 4 hours.  
 
Load thoughts:
- From a network perspective, load would be somewhat significant (200
bytes per trap * 200 traps/minute * 1200 sites * 8 bits/byte / 60
seconds) = 6.4 Mbps /second.
- The server would be responding to (200 traps/min * 1200 sites / 60
seconds) = 4000 traps/second.  That sounds like a whole lot to me.

Corporate requirements:
 - Be aware of some service failed out in the remote site
 - Be aware of a network outage (i.e., no traps received w/in x
minutes)
 - A centralized view of uptime, what's currently down  outage history
 - Outage history has to be down to the each service / server level 



Thanks,
Tim

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Jim Trocki
Sent: Thursday, August 24, 2006 9:15 AM
To: David Nolan
Cc: mon@linux.kernel.org
Subject: Re: Question on Redistribute

On Thu, 24 Aug 2006, David Nolan wrote:

 --On Thursday, August 24, 2006 08:21:16 -0500 Tim Carr
[EMAIL PROTECTED]

 The problem is that we're going to need to turn the monitoring period
 for several of the remote site monitors in each location way up -
like
 checking every 10 seconds (i.e., interval 10s).  That mean we're
going
 to see a huge increase in the number of traps we're seeing at the
 corporate site.

 Or we could implement a redistributeevery option, similar to
alertevery.
 That wouldn't be too hard, but would take a little work.

yeah the issue here is the processing and communication overhead of
dealing
with the traps sent remotely. it would make sense to batch up the 10s
traps
from the remote systems and send them out in a bundle say, once every
minute,
and that would, you know, save you 6x the processing overhead on the
remote mon
server, or at least give you a way to control the processing overhead to
suit
your needs.

this use case might mean that it would make sense to move the remote
trap stuff
into the mon server itself, rather than implement it with the trap
alert. the
trap alert is a nice simple abstraction that works well for the simpler
cases,
and an elegant way of extending the functionality of mon without having
to
change the server code, but at the cost of efficiency. you would really
want
the ability to batch up only the trap transmissions rather than all
alerts.
for example, schedule a trap queue flush every minute performed by the
mon
server rather than in the trap alert.

then this brings up the issue of trap processing overhead on the rx end.
i
wonder if the behavior would be acceptable by just processing the trap
receptions serially, the way it is done now, or if it would require a
change in
processing method to scale it up efficiently.

this probably requires much more thought and a better understanding of
the
usage scenario.

___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon

___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


RE: Question on Redistribute

2006-08-24 Thread Tim Carr
Very true - still way too much traffic.

Maybe do something with the dependencies options stuff:

- Would it be possible to just send one everything is ok trap for a
new overall check?  Maybe a new monitor script that queries itself to
see if there are any existing problems and will alert based off that?
- I'd also continue to send an alert per service if a new service
problem is detected.
- On the corporate server, I'd setup only setup one service per store
entry that would have the traptimeout monitor (to watch for the
network outages) but still have a service entry for each server to catch
any of the specific service outage traps that would be received.

That would drop us down to 2400 traps/minute (64kb / sec) + any outages
traps.

Make sense?

Thanks,
Tim
-Original Message-
From: David Nolan [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 24, 2006 2:40 PM
To: Tim Carr; Jim Trocki
Cc: mon@linux.kernel.org
Subject: RE: Question on Redistribute



--On Thursday, August 24, 2006 10:18:56 -0500 Tim Carr
[EMAIL PROTECTED] 
wrote:

 4000 traps/second.  That sounds like a whole lot to me.


Holy cr** thats a lot of traps.  Wow, the interesting ways that mon gets

deployed continue to amaze me...

Even if you were only sending one trap per minute per service you would 
have:
25 service * 1 trap/minute * 2 servers * 1200 site = 6 traps/minute,
or 
1000 traps per second.

That still *lot* of traps.  Doing your bandwidth math shows that it
still 
1.6Mbps of trap traffic.

I think you might want to make your mon setup more structured, with 
intermediate collection points that pass status changes only to your
final 
collection point.

-David




___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


RE: Getting 20 instead of spaces

2006-07-27 Thread Tim Carr
Thanks, folks - that cleared up the problem.

Tim
-Original Message-
From: Jim Trocki [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 27, 2006 8:05 AM
To: Tim Carr
Cc: Ed Ravin; mon@linux.kernel.org
Subject: RE: Getting 20 instead of spaces

On Wed, 26 Jul 2006, Tim Carr wrote:

 Is there a later version somewhere?  In looking through the mailing
 lists, there are several proposed patches after that date, but I don't
 see anything that relates to this problem.

use this one:

ftp://ftp.kernel.org/pub/software/admin/mon/devel/mon-client-1.0.0pre2.t
ar.gz

___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


Getting 20 instead of spaces

2006-07-26 Thread Tim Carr








Has anyone run into a problem where mon, in all its outputs
(monshow, monfailures, mon.cgi) is returning anything generated by a config
file (i.e., the description) or a monitoring output (i.e., tcp.monitor), with a
20 instead of a space?



For example, if my config file has a description in it like:



description This is my description



Then it shows up via monshow and mon.cgi (dumped directly to
a file, not through a browser) as:

 

This20is20my20description



But if I create the config file to be:



 description This\ is\ my\ description



then I get my normal description:



 This
is my description



That trick doesnt work for the alerts file, though.
Escaping spaces still has the 20s in them.



Any thoughts? Im running mon v 1.22 (7/13/2006)
and monshow v 1.1 (6/29/2006).



Thanks,

Tim






___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


Nagios and MON integration

2006-07-20 Thread Tim Carr








Has anyone got something like this going... 



-
Were going to have an
environment where weve got lots of servers out in the field (i.e., at over
a thousand locations, two servers per locations) that are running mon to watch
both themselves and their local standby server. If something goes awry
out in the field, the remote servers are going to alert the centralized
monitoring console back at the corporate office. 

-
The corporate office servers
purpose is to make sure that everyone out in the field is working correctly,
track uptime statistics, and alert the IT staff if/when something goes wrong
out in the field. Im thinking about setting up the corporate
monitoring server to just expect mon traps every x minutes from
each of the servers, and if it doesnt get a timely response, alert on
that as well.



What Im trying to get running back at corporate is:



-
A good looking display. The
mon.cgi web page is perfect for each of the branch servers (were going
to be running it out on each of the remote servers), but I dont think
its going to scale well back at corporate trying to show a picture of all the
checks it is expecting from 2000+ different servers.

-
A way to perform checks upon demand,
but not try to be a full-fledged polling system. Basically, expect
reports in from the servers and allow an ad-hoc request.

-
Good roll up views (i.e., geography
based or company division based).

-
Statistics available for each
branch, the environment as a whole, and hopefully a division/geography based
breakout as well.



Does anyone have anything like Nagios tied into mon servers for
that kind of centralized console? Or is there perhaps another way of
doing this?



Suggestions are welcome.



Thanks,

Tim








___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


RE: Problem getting traps to work correctly

2006-07-14 Thread Tim Carr
That did the trick.  Thanks for your help.

Thanks,
Tim


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of David
Nolan
Sent: Thursday, July 13, 2006 5:03 PM
To: Tim Carr
Cc: mon@linux.kernel.org
Subject: Re: Problem getting traps to work correctly

On 7/13/06, Tim Carr [EMAIL PROTECTED] wrote:
 Here's a bit more information on it.  I've got the slave server
 configured for multiple services, each of them using the
redistribute
 option:

redistribute alert trap.alert mainmonitor


If thats an exact quote you've got the option wrong.  Its just
redistribute trap.alert mainmonitor.

 On the master server, once I've reset it, none of those servers will
 ever go green/good in mon.cgi - they stay in blue/unchecked status.


That sounds like you've still got the period based trap configuration
in place.  (Which would match with the above typo.)

If thats not true, and the line above was a typo in the email not the
configuration, then maybe the redistribute code in CVS is broken.
Before I go investigate that possibility please confirm whether the
line above was an exact quote from your config file.

 In the slave server, the history file shows this for an outage
event:

 alert Store13-2 DRBD_Status 1152819579 /opt/mon/alert.d/trap.alert
 (mainmonitor) DRBD_Not_Running
 upalert Store13-2 DRBD_Status 1152819594 /opt/mon/alert.d/trap.alert
 (mainmonitor) DRBD_Not_Running


This also indicates to me that your old alert/upalert configuration is
still in place, because redistribute does not generate history
entries, because doing so would bloat the history file on the slave
server.

-David

___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


RE: Problem getting traps to work correctly

2006-07-13 Thread Tim Carr
Gotcha.  I threw that in, and it seems to work correctly, except I can't
tell if it is or not.  I'm watching the log file, and it shows alerts
being sent on an up/down event, but I'm not seeing alerts every 15s
showing up when things are working correctly.  Is that expected
behavior?

Thanks,
Tim

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of David Nolan
Sent: Thursday, July 13, 2006 7:06 AM
To: mon@linux.kernel.org
Subject: Re: Problem getting traps to work correctly



--On Wednesday, July 12, 2006 16:30:08 -0500 Tim Carr
[EMAIL PROTECTED] 
wrote:

 When mainmonitor gets one of the traps, I'll see this in
 /var/log/messages:



 Jul 11 16:20:04 monitor mon[2017]: trap received for undefined service
 type default/DRBD_Status



 ...but nothing will actually get kicked off and no mail is sent.
Also,
 the mon.cgi program (running on mainmonitor) will stay in the
 blue/unchecked status.


Looks like you've found a logic bug.  The code to set the group 
service 
in handle_trap to default/default has an error which causes it to set
the 
group but never the service.  I just commited a fix for this to CVS.


 Jul 11 16:24:49 monitor mon[2017]: trap trap 0 from  grp=default
 svc=DRBD_Status, sta=0


In this case you're not getting an alert because the status bit of the
trap 
is set to 0, which is the OK status.  It looks like the remote.alert in
CVS 
was never updated when Mon starting using that field.  trap.alert was 
rewritten...  Since these two alerts server the same purpose I'm going
to 
remove remote.alert from CVS.



 Any thoughts as to what's going on here?  I'm trying to get this
 working:

 -An alert getting kicked off by the mainmonitor's system when
it
 receives a trap; and

 -The mon.cgi program on mainmonitor showing an alert status
once
 its received that trap.



BTW, you might want to use the 'redistribute' config parameter for your 
traps, that will cause all status updates to propagate to your main mon 
server.  That way you can see when the last test occurred at all times. 
From the current Mon manpage (in CVS):
   redistribute alert [arg...]
A service may have one redistribute option, which is a special form of
an 
an alert definition.  This alert  will be called on every service status

update, even sequential success status updates.  This can be used to 
integrate Mon with another monitoring system, or to link together
multiple 
Mon servers via an alert script that generates Mon traps.


-David


___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon

___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


RE: Problem getting traps to work correctly

2006-07-13 Thread Tim Carr
Here's a bit more information on it.  I've got the slave server
configured for multiple services, each of them using the redistribute
option:

   redistribute alert trap.alert mainmonitor

On the master server, once I've reset it, none of those servers will
ever go green/good in mon.cgi - they stay in blue/unchecked status.

If I force a failure on one of those services, the master server will
show that service going red.  Once I re-enable the service, it will then
show green.  But the other services for that slave server will never
change from unchecked state on the master server's mon.cgi.  Also, if
I then re-set the mon process on the master server, all items will go
back to blue and will not change again unless I force a failure.

Also, I'm logging all output to a file on both the master and slave
server via these commands:

logdir = /var/log/mon
historicfile = /var/log/mon/history

In the slave server, the history file shows this for an outage event:

alert Store13-2 DRBD_Status 1152819579 /opt/mon/alert.d/trap.alert
(mainmonitor) DRBD_Not_Running
upalert Store13-2 DRBD_Status 1152819594 /opt/mon/alert.d/trap.alert
(mainmonitor) DRBD_Not_Running

On the master server, it will only log this for that same event:

trapalert Store13-2 DRBD_Status 1152819578 /opt/mon/alert.d/mail.alert
([EMAIL PROTECTED]) DRBD_Not_Running

Thanks,
Tim


-Original Message-
From: David Nolan [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 13, 2006 2:32 PM
To: Tim Carr; mon@linux.kernel.org
Subject: RE: Problem getting traps to work correctly



--On Thursday, July 13, 2006 14:20:38 -0500 Tim Carr
[EMAIL PROTECTED] 
wrote:

 Gotcha.  I threw that in, and it seems to work correctly, except I
can't
 tell if it is or not.  I'm watching the log file, and it shows alerts
 being sent on an up/down event, but I'm not seeing alerts every 15s
 showing up when things are working correctly.  Is that expected
 behavior?

 Thanks,
 Tim


I refer to the server that sends the traps as a slave server, and the 
server collecting the traps as the master server.  Your master server 
should receive a trap on every status update on the slave server, i.e. a

trap every 15s in your example.  The master should only alert based on
its 
alert behavior.  This makes receving updates via traps almost
functionally 
equivelant to other monitor tests that you run on your master server.

If thats not the behavior you're seeing please let me know.

-David


___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


Problem getting traps to work correctly

2006-07-12 Thread Tim Carr








Greetings, all. Im running on the
latest files from CVS. Im trying to setup a test environment where
one server (named branch-1) will alert a master server (mainmonitor)
in the event of a problem. I can get the branch-1 server to recognize a
problem and send an alert to the master server, and the master server will show
that trap being received (in /var/log/messages), but I cant get the
mainmonitor server to actually then kick off an alert action.



Heres my pertinent configuration for the lower
server (from mon.cf):



watch OtherServer

 service DRBD_Status

 interval
15s

 monitor
DRBDCheck.monitor -s you


description Is\ DRBD\ working\ there?

 period wd
{Mon-Sun}


alert remote.alert -H mainmonitor



I can see that trap being sent via that servers
/var/log/messages:



Jul 12 16:17:05 branch-1 mon[2649]: failure for
OtherServer DRBD_Status 1152739025 DRBD_Not_Running 

Jul 12 16:17:05 branch-1 mon[2649]: calling alert
remote.alert for OtherServer/DRBD_Status (/opt/mon/alert.d/remote.alert,-H
mainmonitor) DRBD_Not_Running



On the mainmonitor server, my mon.cf config
is:



watch default

 service default

 description Default trap
service

 period wd {Mon-Sun}


alert mail.alert [EMAIL PROTECTED]



Ive got my auth.cf set to receive traps from
anyone (* * *).



When mainmonitor gets one of the traps, Ill see
this in /var/log/messages:



Jul 11 16:20:04 monitor mon[2017]: trap received for
undefined service type default/DRBD_Status



...but nothing will actually get kicked off and no
mail is sent. Also, the mon.cgi program (running on mainmonitor) will
stay in the blue/unchecked status.





Interestingly enough, if I change my mainmonitor
servers mon.cf to be:



watch default

 service DRBD_Status

 description Default trap
service

 period wd {Mon-Sun}


alert mail.alert [EMAIL PROTECTED]



Ill get this in the /var/log/messages:



Jul 11 16:24:49 monitor mon[2017]: trap trap 0
from grp=default svc=DRBD_Status, sta=0



...but there is still nothing kicked off and the no
mail gets sent. But...the mon.cgi program then shows the default service
group to be in the green/good status.



Any thoughts as to whats going on here?
Im trying to get this working:



-
An alert getting kicked off
by the mainmonitors system when it receives a trap; and

-
The mon.cgi program on
mainmonitor showing an alert status once its received that trap.



Thanks,

Tim








___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon