In message <4c377baa.4060...@umn.edu>,
Tim Peiffer writes:

>How does one set up flow-control throttling for events?  I am using SEC 
>to de-duplicate/debounce, and mostly use it to create tickets that 
>document network availability.  I would rather throttle or delay a 
>ticket rather than ignoring it.
>
>Periodically, I may get power events on campus.  It would be normal for 
>SEC to write 3-5 tickets an hour, but not out of the question to see 
>100-200 tickets for power events.  Unfortunately, the resulting 'flood' 
>tends to crash the SMTP ticket gateway.  Has anyone created a queueing 
>mechanism within the rules for SEC, or will this action need to happen 
>outside of SEC?  I suspect that it would need to be the latter, but I 
>would not mind examples.

The single with threshold or single with two thresholds rules can be
used to do flood suppression. They are most easily used when you have
multiple instances of a single indicator (i.e. the flooding event) but
you can create fingerprints of other failure modes (e.g. I once
suppressed all devices on a switch config when stp was melting
regularly).

If you can identify the fingerprint of what a power failure looks
like, you can use a single with threshold event and a context to do
flood control. E.G. lets assume that you have three events that come
in when you get a power fail:

  power fail 1
  power fail 2
  power fail 3

and you want to suppress all events for all devices for a period of
say 30 minutes (long enough to get over there and flip a
breaker). Something like (totally untested):

  type = singlewiththreshold
  continue = takenext
  desc = detect power failure pattern
  pattern = power fail (1|2|3)
  action = create suppress_power_fail_alerts 1800 ; pipe \
     'Suppressing future alerts for 30 minutes' \
     mailx -s "power fail detected on nodes 1 2 and 3" trouble
  rem = all three need to come in in 5 minutes
  window = 300
  thresh = 3
    (note this will also trip if you get three power fail 1 alerts but...)

  type = suppress
  desc = suppress power failures
  rem = suppress all events from getting past this point
  ptype = tvalue
  pattern = TRUE
  context = suppress_power_fail_alerts

  other rules that would report on a power fail event if it wasn't
  a building/site wide outage.

This allows the first three indicator events to come through and generate
tickets. 

When it detects the power is down in a larger scope, it sets a context
(30 minute duration in this case) that activates the suppress rule to
ignore future events so they don't start generating a ton of tickets
(the suppress rule canuse a regexp pattern to select just particular
events but...). When the singlewiththreshold detects a more system
wide outage, it sends an email to the trouble system to
escallate/update the folks there. After 30 minutes the
suppress_power_fail_alerts expires, deactivates the suppress rule and
future power fail events will come through and trigger the following
rules as usual.

You can get more sophisticated using contexts to relate facts
together. E.G. suppose you have power bus 1 supplying the power to
devices 1, 2 and 3. To suppress just the devices on the same power bus
you could use:

  type = suppress
  desc = suppress power failures
  rem = suppress all events from getting past this point
  pattern = power fail (\d+)
  context = suppress_power_fail_alerts_bus1 && device_$1_on_bus_1

where suppress_power_fail_alerts_bus1 would be set by the
singlewiththreshold rule in response to the failure of devices 1, 2
and 3. 

The context device_$1_on_bus_1 for the rest of the devices on the same
bus: say 20, 26, 34 can be loaded into SEC when it starts (by a spawn
action for example) or dynamically created/deleted on the fly using a
control channel to change the internal state of SEC manually.

So you can get pretty tight flood control if you want to put the
effort into it. For infrequent events (as I would hope power failures
would be), I usually just stop processing host down and a few other
events for all hosts regardless of where they are. Due to bad timing I
may have a host down on the other side of the building (due to a bad
power supply or something) that was suppressed accidently by the power
failure flood protection. But the host down condition will generate a
new event for me after we have power back up and I can handle that
host then. If you don't have active monitoring that will let you do
this then you need to be more specific in what you select for
flood suppression.

Hopefully this gives you some ideas.

--
                                -- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Simple-evcorr-users mailing list
Simple-evcorr-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users

Reply via email to