RE: Question on Redistribute
--On Thursday, August 24, 2006 14:54:05 -0500 Tim Carr [EMAIL PROTECTED] wrote: - Would it be possible to just send one everything is ok trap for a new overall check? Maybe a new monitor script that queries itself to see if there are any existing problems and will alert based off that? - I'd also continue to send an alert per service if a new service problem is detected. - On the corporate server, I'd setup only setup one service per store entry that would have the traptimeout monitor (to watch for the network outages) but still have a service entry for each server to catch any of the specific service outage traps that would be received. One scenario I can envision that would work which may be what you're trying to describe here is: - Services at remote sites monitored at desired frequency (10s), traps sent to corporate via alert/upalert, i.e. only during failures. - On the real services do not configure an alertevery option, so traps are resent every 10 seconds, in case the UDP packet is dropped. - You probably would also want a startupalert configured here to set the initial status to OK on the corporate server. - Add one fake service that always returns an OK result, run it once per minute and redistribute the status to corporate. For this service only you would want traptimeout configured at corporate. - Possibly add monitoring of the remote sites from the corporate server, including monitoring of Mon itself. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
RE: Question on Redistribute
Thanks for all the interest and quick replies here. A bit more about what we're trying to do... Remote site stats: - There are going to be roughly 1200 remote locations (which could grow). - Each location has a pair of servers in it. - We have 2 servers in the remote site as we're doing a high-availability failover model. - We're running mon on both servers in the remote site as each server is setup to watch key services on itself and the other server. If a failure in a critical service/software component is detected, and the server is the primary one, mon will send an alert to that server to kill itself and the second one will kick in. - Each server's Mon instance is watching about 25 things checking each one every minute. - Every time any check is run on either mon server, we're redistributing it back to the corporate monitoring server. - We need to tune the failover model to detect things quicker, so about 15 of the traps need to be run every 10 seconds. - ((15 traps * 6 traps/minute) + 10 other traps/minute) = (100 traps per server per minute * 2 servers / location) = 200 traps per minute sent to the corporate monitoring server Corporate site stats: - Failures are not expected very often - maybe 1-2 server problems per day - In the case of a large network outage, we might see up to a couple of hundred servers go away. That would actually reduce load on the server as we won't be seeing traps from those locations. - There shouldn't be a lot of load generated from the corporate server needing to do something about alerts. If a failure is detected for any monitor for any site, we're going to send an email once every 4 hours. Load thoughts: - From a network perspective, load would be somewhat significant (200 bytes per trap * 200 traps/minute * 1200 sites * 8 bits/byte / 60 seconds) = 6.4 Mbps /second. - The server would be responding to (200 traps/min * 1200 sites / 60 seconds) = 4000 traps/second. That sounds like a whole lot to me. Corporate requirements: - Be aware of some service failed out in the remote site - Be aware of a network outage (i.e., no traps received w/in x minutes) - A centralized view of uptime, what's currently down outage history - Outage history has to be down to the each service / server level Thanks, Tim -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jim Trocki Sent: Thursday, August 24, 2006 9:15 AM To: David Nolan Cc: mon@linux.kernel.org Subject: Re: Question on Redistribute On Thu, 24 Aug 2006, David Nolan wrote: --On Thursday, August 24, 2006 08:21:16 -0500 Tim Carr [EMAIL PROTECTED] The problem is that we're going to need to turn the monitoring period for several of the remote site monitors in each location way up - like checking every 10 seconds (i.e., interval 10s). That mean we're going to see a huge increase in the number of traps we're seeing at the corporate site. Or we could implement a redistributeevery option, similar to alertevery. That wouldn't be too hard, but would take a little work. yeah the issue here is the processing and communication overhead of dealing with the traps sent remotely. it would make sense to batch up the 10s traps from the remote systems and send them out in a bundle say, once every minute, and that would, you know, save you 6x the processing overhead on the remote mon server, or at least give you a way to control the processing overhead to suit your needs. this use case might mean that it would make sense to move the remote trap stuff into the mon server itself, rather than implement it with the trap alert. the trap alert is a nice simple abstraction that works well for the simpler cases, and an elegant way of extending the functionality of mon without having to change the server code, but at the cost of efficiency. you would really want the ability to batch up only the trap transmissions rather than all alerts. for example, schedule a trap queue flush every minute performed by the mon server rather than in the trap alert. then this brings up the issue of trap processing overhead on the rx end. i wonder if the behavior would be acceptable by just processing the trap receptions serially, the way it is done now, or if it would require a change in processing method to scale it up efficiently. this probably requires much more thought and a better understanding of the usage scenario. ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
Re: Question on Redistribute
--On Thursday, August 24, 2006 10:14:48 -0400 Jim Trocki [EMAIL PROTECTED] wrote: On Thu, 24 Aug 2006, David Nolan wrote: --On Thursday, August 24, 2006 08:21:16 -0500 Tim Carr [EMAIL PROTECTED] The problem is that we're going to need to turn the monitoring period for several of the remote site monitors in each location way up - like checking every 10 seconds (i.e., interval 10s). That mean we're going to see a huge increase in the number of traps we're seeing at the corporate site. Or we could implement a redistributeevery option, similar to alertevery. That wouldn't be too hard, but would take a little work. yeah the issue here is the processing and communication overhead of dealing with the traps sent remotely. it would make sense to batch up the 10s traps from the remote systems and send them out in a bundle say, once every minute, and that would, you know, save you 6x the processing overhead on the remote mon server, or at least give you a way to control the processing overhead to suit your needs. this use case might mean that it would make sense to move the remote trap stuff into the mon server itself, rather than implement it with the trap alert. the trap alert is a nice simple abstraction that works well for the simpler cases, and an elegant way of extending the functionality of mon without having to change the server code, but at the cost of efficiency. you would really want the ability to batch up only the trap transmissions rather than all alerts. for example, schedule a trap queue flush every minute performed by the mon server rather than in the trap alert. I could see benefits to that capability, in addition to the current redistribute support. My original idea for redistribute was that it could be used to integrate mon with other systems as well, because its just an arbitrary script that you can provide. i.e. it could send status updates to Open View, or log status updates to a database, or anything else you might want. The ability to use it for integration with remote mon servers is just a bonus... then this brings up the issue of trap processing overhead on the rx end. i wonder if the behavior would be acceptable by just processing the trap receptions serially, the way it is done now, or if it would require a change in processing method to scale it up efficiently. For the record, my master server is a 2.8Ghz P4, and basically runs at zero load while processing the trap load I described earlier, and running a few tests of its own. I'm sure there is a limit to reasonable trap load, but we haven't hit it yet. this probably requires much more thought and a better understanding of the usage scenario. I agree. I suspect Tim's usage scenario involves large numbers of servers sending monitoring relatively small environments, so I doubt he'll have any processing load problem. But we're not quite sure of the scale of Tim's setup. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
RE: Question on Redistribute
--On Thursday, August 24, 2006 10:18:56 -0500 Tim Carr [EMAIL PROTECTED] wrote: 4000 traps/second. That sounds like a whole lot to me. Holy cr** thats a lot of traps. Wow, the interesting ways that mon gets deployed continue to amaze me... Even if you were only sending one trap per minute per service you would have: 25 service * 1 trap/minute * 2 servers * 1200 site = 6 traps/minute, or 1000 traps per second. That still *lot* of traps. Doing your bandwidth math shows that it still 1.6Mbps of trap traffic. I think you might want to make your mon setup more structured, with intermediate collection points that pass status changes only to your final collection point. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon
RE: Question on Redistribute
Very true - still way too much traffic. Maybe do something with the dependencies options stuff: - Would it be possible to just send one everything is ok trap for a new overall check? Maybe a new monitor script that queries itself to see if there are any existing problems and will alert based off that? - I'd also continue to send an alert per service if a new service problem is detected. - On the corporate server, I'd setup only setup one service per store entry that would have the traptimeout monitor (to watch for the network outages) but still have a service entry for each server to catch any of the specific service outage traps that would be received. That would drop us down to 2400 traps/minute (64kb / sec) + any outages traps. Make sense? Thanks, Tim -Original Message- From: David Nolan [mailto:[EMAIL PROTECTED] Sent: Thursday, August 24, 2006 2:40 PM To: Tim Carr; Jim Trocki Cc: mon@linux.kernel.org Subject: RE: Question on Redistribute --On Thursday, August 24, 2006 10:18:56 -0500 Tim Carr [EMAIL PROTECTED] wrote: 4000 traps/second. That sounds like a whole lot to me. Holy cr** thats a lot of traps. Wow, the interesting ways that mon gets deployed continue to amaze me... Even if you were only sending one trap per minute per service you would have: 25 service * 1 trap/minute * 2 servers * 1200 site = 6 traps/minute, or 1000 traps per second. That still *lot* of traps. Doing your bandwidth math shows that it still 1.6Mbps of trap traffic. I think you might want to make your mon setup more structured, with intermediate collection points that pass status changes only to your final collection point. -David ___ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon