RE: Question on Redistribute

2006-08-25 Thread David Nolan


--On Thursday, August 24, 2006 14:54:05 -0500 Tim Carr [EMAIL PROTECTED] 
wrote:

 - Would it be possible to just send one everything is ok trap for a
 new overall check?  Maybe a new monitor script that queries itself to
 see if there are any existing problems and will alert based off that?
 - I'd also continue to send an alert per service if a new service
 problem is detected.
 - On the corporate server, I'd setup only setup one service per store
 entry that would have the traptimeout monitor (to watch for the
 network outages) but still have a service entry for each server to catch
 any of the specific service outage traps that would be received.


One scenario I can envision that would work which may be what you're trying 
to describe here is:
- Services at remote sites monitored at desired frequency (10s), traps sent 
to corporate via alert/upalert, i.e. only during failures.
- On the real services do not configure an alertevery option, so traps are 
resent every 10 seconds, in case the UDP packet is dropped.
- You probably would also want a startupalert configured here to set the 
initial status to OK on the corporate server.
- Add one fake service that always returns an OK result, run it once per 
minute and redistribute the status to corporate.  For this service only you 
would want traptimeout configured at corporate.
- Possibly add monitoring of the remote sites from the corporate server, 
including monitoring of Mon itself.


-David

___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


RE: Question on Redistribute

2006-08-24 Thread Tim Carr
Thanks for all the interest and quick replies here.  A bit more about
what we're trying to do...

Remote site stats:
- There are going to be roughly 1200 remote locations (which could
grow).
- Each location has a pair of servers in it.
- We have 2 servers in the remote site as we're doing a
high-availability  failover model.
- We're running mon on both servers in the remote site as each server is
setup to watch key services on itself and the other server.  If a
failure in a critical service/software component is detected, and the
server is the primary one, mon will send an alert to that server to kill
itself and the second one will kick in.
- Each server's Mon instance is watching about 25 things  checking each
one every minute.
- Every time any check is run on either mon server, we're redistributing
it back to the corporate monitoring server.
- We need to tune the failover model to detect things quicker, so about
15 of the traps need to be run every 10 seconds.
- ((15 traps * 6 traps/minute) + 10 other traps/minute) = (100 traps per
server per minute * 2 servers / location) = 200 traps per minute sent to
the corporate monitoring server

Corporate site stats:
- Failures are not expected very often - maybe 1-2 server problems per
day
- In the case of a large network outage, we might see up to a couple of
hundred servers go away.  That would actually reduce load on the server
as we won't be seeing traps from those locations.
- There shouldn't be a lot of load generated from the corporate server
needing to do something about alerts.  If a failure is detected for any
monitor for any site, we're going to send an email once every 4 hours.  
 
Load thoughts:
- From a network perspective, load would be somewhat significant (200
bytes per trap * 200 traps/minute * 1200 sites * 8 bits/byte / 60
seconds) = 6.4 Mbps /second.
- The server would be responding to (200 traps/min * 1200 sites / 60
seconds) = 4000 traps/second.  That sounds like a whole lot to me.

Corporate requirements:
 - Be aware of some service failed out in the remote site
 - Be aware of a network outage (i.e., no traps received w/in x
minutes)
 - A centralized view of uptime, what's currently down  outage history
 - Outage history has to be down to the each service / server level 



Thanks,
Tim

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Jim Trocki
Sent: Thursday, August 24, 2006 9:15 AM
To: David Nolan
Cc: mon@linux.kernel.org
Subject: Re: Question on Redistribute

On Thu, 24 Aug 2006, David Nolan wrote:

 --On Thursday, August 24, 2006 08:21:16 -0500 Tim Carr
[EMAIL PROTECTED]

 The problem is that we're going to need to turn the monitoring period
 for several of the remote site monitors in each location way up -
like
 checking every 10 seconds (i.e., interval 10s).  That mean we're
going
 to see a huge increase in the number of traps we're seeing at the
 corporate site.

 Or we could implement a redistributeevery option, similar to
alertevery.
 That wouldn't be too hard, but would take a little work.

yeah the issue here is the processing and communication overhead of
dealing
with the traps sent remotely. it would make sense to batch up the 10s
traps
from the remote systems and send them out in a bundle say, once every
minute,
and that would, you know, save you 6x the processing overhead on the
remote mon
server, or at least give you a way to control the processing overhead to
suit
your needs.

this use case might mean that it would make sense to move the remote
trap stuff
into the mon server itself, rather than implement it with the trap
alert. the
trap alert is a nice simple abstraction that works well for the simpler
cases,
and an elegant way of extending the functionality of mon without having
to
change the server code, but at the cost of efficiency. you would really
want
the ability to batch up only the trap transmissions rather than all
alerts.
for example, schedule a trap queue flush every minute performed by the
mon
server rather than in the trap alert.

then this brings up the issue of trap processing overhead on the rx end.
i
wonder if the behavior would be acceptable by just processing the trap
receptions serially, the way it is done now, or if it would require a
change in
processing method to scale it up efficiently.

this probably requires much more thought and a better understanding of
the
usage scenario.

___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon

___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


Re: Question on Redistribute

2006-08-24 Thread David Nolan


--On Thursday, August 24, 2006 10:14:48 -0400 Jim Trocki 
[EMAIL PROTECTED] wrote:

 On Thu, 24 Aug 2006, David Nolan wrote:

 --On Thursday, August 24, 2006 08:21:16 -0500 Tim Carr
 [EMAIL PROTECTED]

 The problem is that we're going to need to turn the monitoring period
 for several of the remote site monitors in each location way up - like
 checking every 10 seconds (i.e., interval 10s).  That mean we're going
 to see a huge increase in the number of traps we're seeing at the
 corporate site.

 Or we could implement a redistributeevery option, similar to alertevery.
 That wouldn't be too hard, but would take a little work.

 yeah the issue here is the processing and communication overhead of
 dealing with the traps sent remotely. it would make sense to batch up the
 10s traps from the remote systems and send them out in a bundle say, once
 every minute, and that would, you know, save you 6x the processing
 overhead on the remote mon server, or at least give you a way to control
 the processing overhead to suit your needs.

 this use case might mean that it would make sense to move the remote trap
 stuff into the mon server itself, rather than implement it with the trap
 alert. the trap alert is a nice simple abstraction that works well for
 the simpler cases, and an elegant way of extending the functionality of
 mon without having to change the server code, but at the cost of
 efficiency. you would really want the ability to batch up only the trap
 transmissions rather than all alerts. for example, schedule a trap
 queue flush every minute performed by the mon server rather than in the
 trap alert.


I could see benefits to that capability, in addition to the current 
redistribute support.

My original idea for redistribute was that it could be used to integrate 
mon with other systems as well, because its just an arbitrary script that 
you can provide.  i.e. it could send status updates to Open View, or log 
status updates to a database, or anything else you might want.  The ability 
to use it for integration with remote mon servers is just a bonus...

 then this brings up the issue of trap processing overhead on the rx end.
 i wonder if the behavior would be acceptable by just processing the trap
 receptions serially, the way it is done now, or if it would require a
 change in processing method to scale it up efficiently.

For the record, my master server is a 2.8Ghz P4, and basically runs at zero 
load while processing the trap load I described earlier, and running a few 
tests of its own.  I'm sure there is a limit to reasonable trap load, but 
we haven't hit it yet.


 this probably requires much more thought and a better understanding of the
 usage scenario.


I agree.  I suspect Tim's usage scenario involves large numbers of servers 
sending monitoring relatively small environments, so I doubt he'll have any 
processing load problem.  But we're not quite sure of the scale of Tim's 
setup.

-David



___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


RE: Question on Redistribute

2006-08-24 Thread David Nolan


--On Thursday, August 24, 2006 10:18:56 -0500 Tim Carr [EMAIL PROTECTED] 
wrote:

 4000 traps/second.  That sounds like a whole lot to me.


Holy cr** thats a lot of traps.  Wow, the interesting ways that mon gets 
deployed continue to amaze me...

Even if you were only sending one trap per minute per service you would 
have:
25 service * 1 trap/minute * 2 servers * 1200 site = 6 traps/minute, or 
1000 traps per second.

That still *lot* of traps.  Doing your bandwidth math shows that it still 
1.6Mbps of trap traffic.

I think you might want to make your mon setup more structured, with 
intermediate collection points that pass status changes only to your final 
collection point.

-David



___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


RE: Question on Redistribute

2006-08-24 Thread Tim Carr
Very true - still way too much traffic.

Maybe do something with the dependencies options stuff:

- Would it be possible to just send one everything is ok trap for a
new overall check?  Maybe a new monitor script that queries itself to
see if there are any existing problems and will alert based off that?
- I'd also continue to send an alert per service if a new service
problem is detected.
- On the corporate server, I'd setup only setup one service per store
entry that would have the traptimeout monitor (to watch for the
network outages) but still have a service entry for each server to catch
any of the specific service outage traps that would be received.

That would drop us down to 2400 traps/minute (64kb / sec) + any outages
traps.

Make sense?

Thanks,
Tim
-Original Message-
From: David Nolan [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 24, 2006 2:40 PM
To: Tim Carr; Jim Trocki
Cc: mon@linux.kernel.org
Subject: RE: Question on Redistribute



--On Thursday, August 24, 2006 10:18:56 -0500 Tim Carr
[EMAIL PROTECTED] 
wrote:

 4000 traps/second.  That sounds like a whole lot to me.


Holy cr** thats a lot of traps.  Wow, the interesting ways that mon gets

deployed continue to amaze me...

Even if you were only sending one trap per minute per service you would 
have:
25 service * 1 trap/minute * 2 servers * 1200 site = 6 traps/minute,
or 
1000 traps per second.

That still *lot* of traps.  Doing your bandwidth math shows that it
still 
1.6Mbps of trap traffic.

I think you might want to make your mon setup more structured, with 
intermediate collection points that pass status changes only to your
final 
collection point.

-David




___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon