Re: [Simple-evcorr-users] What is the proper use of eval and perl function calls? (long feature proposal too)

Tim Peiffer Sun, 27 Apr 2008 16:50:27 -0700

John,

The sketch you offered should accomplish most of what I want. The one 
issue remaining is that the overflow is a counter rather than a guage.  
I am capturing socket overflows, dropped full socket buffers, 
udpInOverflows and packet receive errors from netstat -s.  I wrote the 
monitor a number of years ago.. I suppose I could just keep a local file 
to cache the last counter, but I was hoping I wouldn't need to do that.  
Is there a way to cache values from the previous run in a context?


The log traces are run every 5 minutes off of cron, and are a self 
measure of performance (or lack thereof).


Thanks a lot for getting me started.   I am fairly new to SEC and I 
couldn't seem to get my head around the problem.

Regards,
Tim Peiffer

John P. Rouillard wrote:
> In message <[EMAIL PROTECTED]>,
> Tim Peiffer writes:
>   
>> What is the proper use of eval and perl function 
>> calls for comparisons?
>>
>> Consider the following log trace:
>> 61 msec sz 212456 rss 210232 sock ovfl 244477
>>
>> Given that the log lines has timings, process sizes and count of UDP 
>> socket overflows, I wish to compare say the last 5 traces, and if all 
>> have timings that exceed X msec, I wish to restart the service.
>>     
>
> Case 1, no problem. 
>
>   
>> Similarly If I receive more than Y socket overflows in the past 5 
>> traces, I wish to restart
>>     
>
> rss   is Case 2, I will assume you mean the prior 5 consecutive log entries 
> had
> a sock ovfl > Y and not the sum of the socket overflows for the prior
> 5 entries is > Y. The latter is left as an exercise for the reader
> (although a hint is given below).
>
>   
>> And again, if the process size is greater than Z Mbyte, I wish to restart.
>>     
>
> Case 3. Not a problem.
>
> One question are the log lines emitted on a regularly timed basis, or
> do they arrive randomly so you can't say that 5 log lines arrives in 5
> minutes? For the example below I assume that the lines arrive
> randomly and we are counting consecutive occurrences.
>
> SEC's threshold windows are time based not event based so counting M
> matching events in a window of N events usually requires that the
> events arrive on a regular schedule so the N event window can be
> expressed as a period of time.
>
> So something like:
>
> ==============
>   # case 3 first
>   type=single
>   desc= rule 1: size too large restart
>     rem= use takenext so every event is analyzed 3 times:
>     rem= size, time, overflows
>   continue=takenext
>   ptype=pattern
>     rem= $1 is the time to run, $2 is the size,
>     rem=  and $4 is overflows
>     rem= same pattern is used for every rule below because I am lazy
>   pattern= ([0-9]+) msec sz ([0-9]+) rss ([0-9]+) sock ovfl ([0-9]+)
>     rem= if size greater than 512000 bytes restart 
>   context = =($2 > 512000)
>   action = shellcmd /etc/init.d/program restart
>
>   # case 1
>   type=singlewiththreshold
>   desc= rule 2: each of the last 5 consecutive runs took too long
>   continue=takenext
>     rem= I don't remember if 0 is allowed here. If not we don't want the
>     rem= window to expire and slide, so set to a really large number.
>   window=0
>   thresh=5
>   ptype=pattern
>   pattern= ([0-9]+) msec sz ([0-9]+) rss ([0-9]+) sock ovfl ([0-9]+)
>   context = =($1 > X)
>     rem= restart the program and reset the rule, otherwise the rule won't
>     rem= fire again untill the window runs out (which is never).
>   action = shellcmd /etc/init.d/program restart; reset 0 %s
>
>   type=single
>   desc=  rule 3: timing ok for this run, reset threshold test
>   continue=takenext
>   ptype=pattern
>   pattern= ([0-9]+) msec sz ([0-9]+) rss ([0-9]+) sock ovfl ([0-9]+)
>   context = =($1 <= X)
>     rem= reset the prior single with threshold as we have received
>     rem= a log entry in which the timing is ok, and the 5 count
>     rem= above needs to start from 0 again.
>   action = reset -1 rule 2: each of the last 5 consecutive runs took too long
>
>   # case 2
>   # see comments for prior rule pair and apply below. Same idea
>   # just different parameter
>   type=singlewiththreshold
>   desc= rule 4: each of last 5 consecutive runs had more than Y overflows
>   continue=takenext
>   window=0
>   thresh=5
>   ptype=pattern
>   pattern= ([0-9]+) msec sz ([0-9]+) rss ([0-9]+) sock ovfl ([0-9]+)
>   context = =($4 > Y)
>   action = shellcmd /etc/init.d/program restart: reset 0 %s
>
>   # note no cont=takenext, this consumes the event.
>   type=single
>   desc= overflows ok for this run, reset threshold test
>   ptype=pattern
>   pattern= ([0-9]+) msec sz ([0-9]+) rss ([0-9]+) sock ovfl ([0-9]+)
>   context = =($4 <= Y)
>   action = reset -1  rule 4: each of last 5 consecutive runs had more than Y 
> overflows
> ============
>
> If you need to count the total number of socket resets over the prior
> 5 events and only if the sum is > Y do you restart, a context like:
>
>   =(unshift @overflow, $4; $#overflow=4; $sum=0; map {$sum+=$_} @foo; return 
> $sum > Y)
>
> may work. What this does is adds the new overflow value ($4) at the
> front of the @overflow array. Then it removes the 6th or larger
> element in the array (i.e. anything with index > 4). Then use map to
> sum all the elements in the array and compare the sum against Y.
>
> Tim can probably stop reading now as the following is a discussion on
> how to make this easier and more flexible in the case where the events
> you want to count don't arrive on a regular schedule. I have been
> fighting this issue for a while (since just after the 2.0 release of
> SEC) and come up with some very hairy multi-rule correlations that
> have caused me to lose even more of my hair (so not I won't share
> them). While I think it can be reduced to a couple of rules and some
> perl functions, it's still messy and requires keeping multiple rules
> in sync, and it seems like something that the threshold rules should
> support natively.
>
> >From this example there are a couple of places SEC could be made better:
>
>    1) support a window value of 0 for threshold operations (if not
>       already supported)
>
> so that the window is infinite and never slides.
>
> What would also be nice is some way to have a window that is not time
> based but event based. One idea I had was to allow the threshold rule
> to count two different categories of events. To do this we add an
> eventwindowcontext parameter and an eventwindow parameter:
>
>   type=singlewiththreshold
>   desc= rule 4: each of last 5 consecutive runs had more than Y overflows
>   continue=takenext
>   window=0
>   thresh=5
>   ptype=pattern
>   pattern= ([0-9]+) msec sz ([0-9]+) rss ([0-9]+) sock ovfl ([0-9]+)
>   context = =($4 > Y)
>   eventwindow=5
>   eventwindowcontext = =($4 <= Y)
>
> where eventwindow is the number of events that should be in the window
> where thresh is counted.  (Note @@, it's late and the code and
> examples below actually implement the number of events in the window
> as eventwindow-1. Sorry, but I am not going to go back and fix that.)
> Eventwindowcontext is a context that selects events that should be in
> the window, but not counted towards the thresh. Note that thresh must
> be less than or equal to eventwindow, if it is larger it will never be
> met.
>
> So our two categories for events are:
>
>      1) ones where context (if present) is true
>      2) ones where another context (eventwindowcontext) (if present) is true
>
> both events categories are counted as eventwindow events, but only
> events in 1 are counted as thresh events.  So we have two counters:
> thresh counter (tc) and eventwindow counter (ec) and things work as
> follows:
>
> When an event comes in:
>
>   if event not matched by pattern
>      skip the event
>
>   if context is enclosed in []'s and is false
>      skip the event
>
>   # event matches pattern past here
>
>     if neither context nor eventwindowcontext defined
>        increment tc (treat as though context is true,
>                      same as current operation)
>
>     if context is defined and eventwindowcontext is not defined  *
>       if context is true: increment tc
>       if context is false: skip event
>       
>     if context is not defined and eventwindowcontext is defined  **
>       increment tc and ec
>
>     if both context and eventwindowcontext are defined
>       if context is true and eventwindowcontext is any:
>             increment ec and increment tc
>         if if context is false and eventwindowcontext is true:
>             increment ec
>       if context is false and eventwindowcontext is false:
>             skip event
>       if context is true and eventwindowcontext is true:
>             well this is most likely an error, but context takes
>           precedence so it increments ec and tc.
>
>   
>   if window != 0 and the time from the oldest event to the
>            newest event exceeds 'window'                        *** 
>      for each event older then current time-window:
>                (note same as rules when ec = eventwindow)
>        if the event incremented tc when it was added
>           decrement both tc and ec by 1
>           (unless ec is 0 which is it's min value) 
>        else (the oldest event must only have incremented ec)
>            decrement only ec by 1
>
>   if the event causes tc to equal thresh: trigger the action 
>       and remain idle until current time is (oldest event time)+window
>       (same as current)
>
>   if the event causes ec to equal eventwindow: slide the window 
>          by removing the oldest event (see note @@ above)
>      if the oldest event incremented tc when it was added
>         decrement both tc and ec by 1
>         (unless ec is 0 which is it's min value) 
>      else (the oldest event must only have incremented ec)
>          decrement only ec by 1
>
>
>  * If eventwindow is defined in this case, I claim it's an error.
>    If it is not defined, then this is the current threshold.
>
>  ** Not defining both context and eventwindowcontext is probably an
>     error as well.
>
>  *** if window = 0, there is no time based window and this path
>      is never run.
>
> Ok, now for a couple of examples.
>
> So start an example where we have 5 consecutive true values for
> context, and window=0, eventwindow=5, thresh=5:
>
>   event 1 context true (thresh count tc=1 eventwindow count ec=1)
>   event 2 context true  (tc=2 ec=2)
>   event 3 context true  (tc=3 ec=3)
>   event 4 context true  (tc=4 ec=4)
>   event 5 context true  (tc=5 ec=5)
>
> at event 5, tc=thresh and the threshold rule executes the action.
> Since window=0, the rule must be reset in the action otherwise it will
> never fire again.
>
> Now let's see what happens if we have a non context matching event at
> 2 with window=0, eventwindow=5, thresh=5:
>
>   event 1 context true (tc=1 ec=1)
>   event 2 does not match context, but matches eventwindowcontext (tc=1 ec=2)
>   event 3 context true (tc=2 ec=3)
>   event 4 context true (tc=3 ec=4)
>   event 5 context true (tc=4 ec=5) window slide (tc=3, ec=4)
>   event 6 context true (tc=4 ec=5) window slide (tc=4 ec=4)
>   event 7 context true (tc=5 ec=5)
>
> Now we see an event 2 that doesn't increment the thresh count, but
> does increment the eventwindow count. When we reach event 5, tc is <
> thresh, so an action isn't executed. But ec = eventwindow and just as
> though we had exceeded a time window, the comparison window shifts
> event 1 out and the counts are changed to tc=3, ec=4. Then event 6
> comes in, tc is still < thresh and ec is once again equal to 5. So
> shift event 2 out of the window.  Event 2 didn't match 'context' when
> it was accepted, so we only end up decrementing ec and not tc so tc=4
> and ec=4. Now event 7 comes in and tc=5 and ec=5 and the action
> triggers.
>
> What happens if window=0, eventwindow=5, thresh=3 (so we need 60% of
> the event to trigger) with the 7 event sequence above and window still
> 0:
>
>   event 1 context true (tc=1 ec=1)
>   event 2 context false, eventwindowcontext true (tc=1 ec=2)
>   event 3 context true (tc=2 ec=3)
>   event 4 context true (tc=3 ec=4)
>
> At event 4 we fire the threshold action. Note that ec <
> eventwindow. This is fine since regardless of what happens when we
> reach ec=eventwindow we have met the threshold.
>
> Ok, lets take the last two cases and run them with window=60.
>
> Now let's see what happens if we have a non context matching event at
> 2 with window=60 eventwindow=5, thresh=5:
>
>   t=0  event 1 context true (tc=1 ec=1)
>   t=10 event 2 context false, eventwindowcontext true (tc=1 ec=2)
>   t=15 event 3 context true (tc=2 ec=3)
>   t=20 event 4 context true (tc=3 ec=4)
>   t=30 event 5 context true (tc=4 ec=5) window slide (tc=3, ec=4) *
>   t=71 event 6 context true (tc=4 ec=5) window slide (tc=4 ec=4) **
>   t=72 event 7 context true (tc=5 ec=5)
>
> When we reach event 5, ec = eventwindow and we shift event 1 away as
> before.  Then event 6 comes in, and event 2 is the oldest. Because
> time(event 6) - time(event 2) = 61 > window (60) we shift event 2
> away. So the window slide at * was due to exceeding the eventwindow
> size. But the window shift at ** was due to exceeding the timing
> constraint.  As before event 2 didn't match 'context' when it was
> accepted, so we only end up decrementing ec and not tc so tc=4 and
> ec=4. Now event 7 comes in and the time constraint is ok, tc=5 and
> ec=5 and the action triggers.
>
> What happens if thresh=3 and eventwindow=5 (so we need 60% of the
> event to trigger) with the 7 event sequence above and window=15 and a bit 
> different timing:
>
>   t=0  event 1 context true (tc=1 ec=1)
>   t=10 event 2 context false, eventwindowcontext true (tc=1 ec=2)
>   t=31 event 3 context true (tc=2 ec=3) window slide (tc=1 ec=1)
>   t=40 event 4 context true (tc=2 ec=2)
>   t=47 event 5 context true (tc=3 ec=3) window slide (tc=1, ec=1)
>   t=50 event 6 context true (tc=3 ec=2) 
>   t=53 event 7 context true (tc=3 ec=3)
>
> Events 1 and 2 proceed normally. At event 3, events 1 and 2 are
> outside the window and are discarded leaving only event 3. Events 4
> and 5 arrived, but the arrival of event 5 causes the window to be
> exceeded and events 3 nd 4 are discarded. Then events 6 and 7 arrive
> so that 5, 6, 7 are all within the window and the action is triggered.
>
> The pseudocode for the SingleWith2Thresholds rule is the same for the
> first threshold, and the second threshold looks like the first
> threshold algorithm up to the point where the values of tc and ec are
> checked. The window is adjusted first and tc must be less than thresh.
>
>   if the event causes ec to equal eventwindow, slide the window 
>          by removing the oldest event
>      if the oldest event incremented tc when it was added
>         decrement both tc and ec by 1
>         (unless ec is 0 which is it's min value) 
>      else (the oldest event must only have incremented ec)
>          decrement only ec by 1
>
>   if the event causes tc to drop below thresh: trigger the action
>       and end correlation
>
> so we change the window size before the threshold check.
>
> Quips, comments, evasions, questions or answers?
>
>
> Also if this is implemented, then it should be possible to handle one
> more correlation detection operation over a random amount of data:
>
>   I want to detect a failure rate of more than 70% using a 5 minute window
>
> or it's equivalent:
>
>   I want to detect a passing rate of less than 30% using a 5 minute window
>
> Using a syntax like:
>
>   type=singlewiththreshold
>   window=300
>   thresh=70%                        <-- note the % sign
>   context = =($4 eq 'fail')         <-- increments tc and ec
>   eventwindowcontext = =($4 eq 'pass')  <-- increments ec only
>
> The failure rate is simply tc/ec (tc is always less than ec if
> eventwindowcontext is defined). However there are some gotcha's lurking here.
>
>    What happens when your first event is a failure? 100% failure > 70%
>
> Setting a minimum number of data points before the evaluation occurs
> helps here. This also prevents trying to evaluate 0/0 8-).
>
> --
>                               -- rouilj
> John Rouillard
> ===========================================================================
> My employers don't acknowledge my existence much less my opinions.
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Simple-evcorr-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users

Re: [Simple-evcorr-users] What is the proper use of eval and perl function calls? (long feature proposal too)

Reply via email to