In message <4c377baa.4060...@umn.edu>, Tim Peiffer writes: >How does one set up flow-control throttling for events? I am using SEC >to de-duplicate/debounce, and mostly use it to create tickets that >document network availability. I would rather throttle or delay a >ticket rather than ignoring it. > >Periodically, I may get power events on campus. It would be normal for >SEC to write 3-5 tickets an hour, but not out of the question to see >100-200 tickets for power events. Unfortunately, the resulting 'flood' >tends to crash the SMTP ticket gateway. Has anyone created a queueing >mechanism within the rules for SEC, or will this action need to happen >outside of SEC? I suspect that it would need to be the latter, but I >would not mind examples.
The single with threshold or single with two thresholds rules can be used to do flood suppression. They are most easily used when you have multiple instances of a single indicator (i.e. the flooding event) but you can create fingerprints of other failure modes (e.g. I once suppressed all devices on a switch config when stp was melting regularly). If you can identify the fingerprint of what a power failure looks like, you can use a single with threshold event and a context to do flood control. E.G. lets assume that you have three events that come in when you get a power fail: power fail 1 power fail 2 power fail 3 and you want to suppress all events for all devices for a period of say 30 minutes (long enough to get over there and flip a breaker). Something like (totally untested): type = singlewiththreshold continue = takenext desc = detect power failure pattern pattern = power fail (1|2|3) action = create suppress_power_fail_alerts 1800 ; pipe \ 'Suppressing future alerts for 30 minutes' \ mailx -s "power fail detected on nodes 1 2 and 3" trouble rem = all three need to come in in 5 minutes window = 300 thresh = 3 (note this will also trip if you get three power fail 1 alerts but...) type = suppress desc = suppress power failures rem = suppress all events from getting past this point ptype = tvalue pattern = TRUE context = suppress_power_fail_alerts other rules that would report on a power fail event if it wasn't a building/site wide outage. This allows the first three indicator events to come through and generate tickets. When it detects the power is down in a larger scope, it sets a context (30 minute duration in this case) that activates the suppress rule to ignore future events so they don't start generating a ton of tickets (the suppress rule canuse a regexp pattern to select just particular events but...). When the singlewiththreshold detects a more system wide outage, it sends an email to the trouble system to escallate/update the folks there. After 30 minutes the suppress_power_fail_alerts expires, deactivates the suppress rule and future power fail events will come through and trigger the following rules as usual. You can get more sophisticated using contexts to relate facts together. E.G. suppose you have power bus 1 supplying the power to devices 1, 2 and 3. To suppress just the devices on the same power bus you could use: type = suppress desc = suppress power failures rem = suppress all events from getting past this point pattern = power fail (\d+) context = suppress_power_fail_alerts_bus1 && device_$1_on_bus_1 where suppress_power_fail_alerts_bus1 would be set by the singlewiththreshold rule in response to the failure of devices 1, 2 and 3. The context device_$1_on_bus_1 for the rest of the devices on the same bus: say 20, 26, 34 can be loaded into SEC when it starts (by a spawn action for example) or dynamically created/deleted on the fly using a control channel to change the internal state of SEC manually. So you can get pretty tight flood control if you want to put the effort into it. For infrequent events (as I would hope power failures would be), I usually just stop processing host down and a few other events for all hosts regardless of where they are. Due to bad timing I may have a host down on the other side of the building (due to a bad power supply or something) that was suppressed accidently by the power failure flood protection. But the host down condition will generate a new event for me after we have power back up and I can handle that host then. If you don't have active monitoring that will let you do this then you need to be more specific in what you select for flood suppression. Hopefully this gives you some ideas. -- -- rouilj John Rouillard =========================================================================== My employers don't acknowledge my existence much less my opinions. ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ Simple-evcorr-users mailing list Simple-evcorr-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users