Re: [Simple-evcorr-users] What is the proper use of eval and perl function calls? (long feature proposal too)

John P. Rouillard Sun, 27 Apr 2008 00:20:30 -0700

In message <[EMAIL PROTECTED]>,
Tim Peiffer writes:
> What is the proper use of eval and perl function 
> calls for comparisons?
>
> Consider the following log trace:
> 61 msec sz 212456 rss 210232 sock ovfl 244477
>
>Given that the log lines has timings, process sizes and count of UDP 
>socket overflows, I wish to compare say the last 5 traces, and if all 
>have timings that exceed X msec, I wish to restart the service.


Case 1, no problem. 

>Similarly If I receive more than Y socket overflows in the past 5 
>traces, I wish to restart

rss   is Case 2, I will assume you mean the prior 5 consecutive log entries had
a sock ovfl > Y and not the sum of the socket overflows for the prior
5 entries is > Y. The latter is left as an exercise for the reader
(although a hint is given below).

>And again, if the process size is greater than Z Mbyte, I wish to restart.

Case 3. Not a problem.

One question are the log lines emitted on a regularly timed basis, or
do they arrive randomly so you can't say that 5 log lines arrives in 5
minutes? For the example below I assume that the lines arrive
randomly and we are counting consecutive occurrences.

SEC's threshold windows are time based not event based so counting M
matching events in a window of N events usually requires that the
events arrive on a regular schedule so the N event window can be
expressed as a period of time.

So something like:

==============
  # case 3 first
  type=single
  desc= rule 1: size too large restart
    rem= use takenext so every event is analyzed 3 times:
    rem= size, time, overflows
  continue=takenext
  ptype=pattern
    rem= $1 is the time to run, $2 is the size,
    rem=  and $4 is overflows
    rem= same pattern is used for every rule below because I am lazy
  pattern= ([0-9]+) msec sz ([0-9]+) rss ([0-9]+) sock ovfl ([0-9]+)
    rem= if size greater than 512000 bytes restart 
  context = =($2 > 512000)
  action = shellcmd /etc/init.d/program restart

  # case 1
  type=singlewiththreshold
  desc= rule 2: each of the last 5 consecutive runs took too long
  continue=takenext
    rem= I don't remember if 0 is allowed here. If not we don't want the
    rem= window to expire and slide, so set to a really large number.
  window=0
  thresh=5
  ptype=pattern
  pattern= ([0-9]+) msec sz ([0-9]+) rss ([0-9]+) sock ovfl ([0-9]+)
  context = =($1 > X)
    rem= restart the program and reset the rule, otherwise the rule won't
    rem= fire again untill the window runs out (which is never).
  action = shellcmd /etc/init.d/program restart; reset 0 %s

  type=single
  desc=  rule 3: timing ok for this run, reset threshold test
  continue=takenext
  ptype=pattern
  pattern= ([0-9]+) msec sz ([0-9]+) rss ([0-9]+) sock ovfl ([0-9]+)
  context = =($1 <= X)
    rem= reset the prior single with threshold as we have received
    rem= a log entry in which the timing is ok, and the 5 count
    rem= above needs to start from 0 again.
  action = reset -1 rule 2: each of the last 5 consecutive runs took too long

  # case 2
  # see comments for prior rule pair and apply below. Same idea
  # just different parameter
  type=singlewiththreshold
  desc= rule 4: each of last 5 consecutive runs had more than Y overflows
  continue=takenext
  window=0
  thresh=5
  ptype=pattern
  pattern= ([0-9]+) msec sz ([0-9]+) rss ([0-9]+) sock ovfl ([0-9]+)
  context = =($4 > Y)
  action = shellcmd /etc/init.d/program restart: reset 0 %s

  # note no cont=takenext, this consumes the event.
  type=single
  desc= overflows ok for this run, reset threshold test
  ptype=pattern
  pattern= ([0-9]+) msec sz ([0-9]+) rss ([0-9]+) sock ovfl ([0-9]+)
  context = =($4 <= Y)
  action = reset -1  rule 4: each of last 5 consecutive runs had more than Y 
overflows
============

If you need to count the total number of socket resets over the prior
5 events and only if the sum is > Y do you restart, a context like:

  =(unshift @overflow, $4; $#overflow=4; $sum=0; map {$sum+=$_} @foo; return 
$sum > Y)

may work. What this does is adds the new overflow value ($4) at the
front of the @overflow array. Then it removes the 6th or larger
element in the array (i.e. anything with index > 4). Then use map to
sum all the elements in the array and compare the sum against Y.

Tim can probably stop reading now as the following is a discussion on
how to make this easier and more flexible in the case where the events
you want to count don't arrive on a regular schedule. I have been
fighting this issue for a while (since just after the 2.0 release of
SEC) and come up with some very hairy multi-rule correlations that
have caused me to lose even more of my hair (so not I won't share
them). While I think it can be reduced to a couple of rules and some
perl functions, it's still messy and requires keeping multiple rules
in sync, and it seems like something that the threshold rules should
support natively.

>From this example there are a couple of places SEC could be made better:

   1) support a window value of 0 for threshold operations (if not
      already supported)

so that the window is infinite and never slides.

What would also be nice is some way to have a window that is not time
based but event based. One idea I had was to allow the threshold rule
to count two different categories of events. To do this we add an
eventwindowcontext parameter and an eventwindow parameter:

  type=singlewiththreshold
  desc= rule 4: each of last 5 consecutive runs had more than Y overflows
  continue=takenext
  window=0
  thresh=5
  ptype=pattern
  pattern= ([0-9]+) msec sz ([0-9]+) rss ([0-9]+) sock ovfl ([0-9]+)
  context = =($4 > Y)
  eventwindow=5
  eventwindowcontext = =($4 <= Y)

where eventwindow is the number of events that should be in the window
where thresh is counted.  (Note @@, it's late and the code and
examples below actually implement the number of events in the window
as eventwindow-1. Sorry, but I am not going to go back and fix that.)
Eventwindowcontext is a context that selects events that should be in
the window, but not counted towards the thresh. Note that thresh must
be less than or equal to eventwindow, if it is larger it will never be
met.

So our two categories for events are:

     1) ones where context (if present) is true
     2) ones where another context (eventwindowcontext) (if present) is true

both events categories are counted as eventwindow events, but only
events in 1 are counted as thresh events.  So we have two counters:
thresh counter (tc) and eventwindow counter (ec) and things work as
follows:

When an event comes in:

  if event not matched by pattern
     skip the event

  if context is enclosed in []'s and is false
     skip the event

  # event matches pattern past here

    if neither context nor eventwindowcontext defined
       increment tc (treat as though context is true,
                     same as current operation)

    if context is defined and eventwindowcontext is not defined  *
        if context is true: increment tc
        if context is false: skip event
        
    if context is not defined and eventwindowcontext is defined  **
        increment tc and ec

    if both context and eventwindowcontext are defined
        if context is true and eventwindowcontext is any:
            increment ec and increment tc
        if if context is false and eventwindowcontext is true:
            increment ec
        if context is false and eventwindowcontext is false:
            skip event
        if context is true and eventwindowcontext is true:
            well this is most likely an error, but context takes
            precedence so it increments ec and tc.

  
  if window != 0 and the time from the oldest event to the
           newest event exceeds 'window'                        *** 
     for each event older then current time-window:
               (note same as rules when ec = eventwindow)
       if the event incremented tc when it was added
          decrement both tc and ec by 1
          (unless ec is 0 which is it's min value) 
       else (the oldest event must only have incremented ec)
           decrement only ec by 1

  if the event causes tc to equal thresh: trigger the action 
      and remain idle until current time is (oldest event time)+window
      (same as current)

  if the event causes ec to equal eventwindow: slide the window 
         by removing the oldest event (see note @@ above)
     if the oldest event incremented tc when it was added
        decrement both tc and ec by 1
        (unless ec is 0 which is it's min value) 
     else (the oldest event must only have incremented ec)
         decrement only ec by 1


 * If eventwindow is defined in this case, I claim it's an error.
   If it is not defined, then this is the current threshold.

 ** Not defining both context and eventwindowcontext is probably an
    error as well.

 *** if window = 0, there is no time based window and this path
     is never run.

Ok, now for a couple of examples.

So start an example where we have 5 consecutive true values for
context, and window=0, eventwindow=5, thresh=5:

  event 1 context true (thresh count tc=1 eventwindow count ec=1)
  event 2 context true  (tc=2 ec=2)
  event 3 context true  (tc=3 ec=3)
  event 4 context true  (tc=4 ec=4)
  event 5 context true  (tc=5 ec=5)

at event 5, tc=thresh and the threshold rule executes the action.
Since window=0, the rule must be reset in the action otherwise it will
never fire again.

Now let's see what happens if we have a non context matching event at
2 with window=0, eventwindow=5, thresh=5:

  event 1 context true (tc=1 ec=1)
  event 2 does not match context, but matches eventwindowcontext (tc=1 ec=2)
  event 3 context true (tc=2 ec=3)
  event 4 context true (tc=3 ec=4)
  event 5 context true (tc=4 ec=5) window slide (tc=3, ec=4)
  event 6 context true (tc=4 ec=5) window slide (tc=4 ec=4)
  event 7 context true (tc=5 ec=5)

Now we see an event 2 that doesn't increment the thresh count, but
does increment the eventwindow count. When we reach event 5, tc is <
thresh, so an action isn't executed. But ec = eventwindow and just as
though we had exceeded a time window, the comparison window shifts
event 1 out and the counts are changed to tc=3, ec=4. Then event 6
comes in, tc is still < thresh and ec is once again equal to 5. So
shift event 2 out of the window.  Event 2 didn't match 'context' when
it was accepted, so we only end up decrementing ec and not tc so tc=4
and ec=4. Now event 7 comes in and tc=5 and ec=5 and the action
triggers.

What happens if window=0, eventwindow=5, thresh=3 (so we need 60% of
the event to trigger) with the 7 event sequence above and window still
0:

  event 1 context true (tc=1 ec=1)
  event 2 context false, eventwindowcontext true (tc=1 ec=2)
  event 3 context true (tc=2 ec=3)
  event 4 context true (tc=3 ec=4)

At event 4 we fire the threshold action. Note that ec <
eventwindow. This is fine since regardless of what happens when we
reach ec=eventwindow we have met the threshold.

Ok, lets take the last two cases and run them with window=60.

Now let's see what happens if we have a non context matching event at
2 with window=60 eventwindow=5, thresh=5:

  t=0  event 1 context true (tc=1 ec=1)
  t=10 event 2 context false, eventwindowcontext true (tc=1 ec=2)
  t=15 event 3 context true (tc=2 ec=3)
  t=20 event 4 context true (tc=3 ec=4)
  t=30 event 5 context true (tc=4 ec=5) window slide (tc=3, ec=4) *
  t=71 event 6 context true (tc=4 ec=5) window slide (tc=4 ec=4) **
  t=72 event 7 context true (tc=5 ec=5)

When we reach event 5, ec = eventwindow and we shift event 1 away as
before.  Then event 6 comes in, and event 2 is the oldest. Because
time(event 6) - time(event 2) = 61 > window (60) we shift event 2
away. So the window slide at * was due to exceeding the eventwindow
size. But the window shift at ** was due to exceeding the timing
constraint.  As before event 2 didn't match 'context' when it was
accepted, so we only end up decrementing ec and not tc so tc=4 and
ec=4. Now event 7 comes in and the time constraint is ok, tc=5 and
ec=5 and the action triggers.

What happens if thresh=3 and eventwindow=5 (so we need 60% of the
event to trigger) with the 7 event sequence above and window=15 and a bit 
different timing:

  t=0  event 1 context true (tc=1 ec=1)
  t=10 event 2 context false, eventwindowcontext true (tc=1 ec=2)
  t=31 event 3 context true (tc=2 ec=3) window slide (tc=1 ec=1)
  t=40 event 4 context true (tc=2 ec=2)
  t=47 event 5 context true (tc=3 ec=3) window slide (tc=1, ec=1)
  t=50 event 6 context true (tc=3 ec=2) 
  t=53 event 7 context true (tc=3 ec=3)

Events 1 and 2 proceed normally. At event 3, events 1 and 2 are
outside the window and are discarded leaving only event 3. Events 4
and 5 arrived, but the arrival of event 5 causes the window to be
exceeded and events 3 nd 4 are discarded. Then events 6 and 7 arrive
so that 5, 6, 7 are all within the window and the action is triggered.

The pseudocode for the SingleWith2Thresholds rule is the same for the
first threshold, and the second threshold looks like the first
threshold algorithm up to the point where the values of tc and ec are
checked. The window is adjusted first and tc must be less than thresh.

  if the event causes ec to equal eventwindow, slide the window 
         by removing the oldest event
     if the oldest event incremented tc when it was added
        decrement both tc and ec by 1
        (unless ec is 0 which is it's min value) 
     else (the oldest event must only have incremented ec)
         decrement only ec by 1

  if the event causes tc to drop below thresh: trigger the action
      and end correlation

so we change the window size before the threshold check.

Quips, comments, evasions, questions or answers?


Also if this is implemented, then it should be possible to handle one
more correlation detection operation over a random amount of data:

  I want to detect a failure rate of more than 70% using a 5 minute window

or it's equivalent:

  I want to detect a passing rate of less than 30% using a 5 minute window

Using a syntax like:

  type=singlewiththreshold
  window=300
  thresh=70%                        <-- note the % sign
  context = =($4 eq 'fail')         <-- increments tc and ec
  eventwindowcontext = =($4 eq 'pass')  <-- increments ec only

The failure rate is simply tc/ec (tc is always less than ec if
eventwindowcontext is defined). However there are some gotcha's lurking here.

   What happens when your first event is a failure? 100% failure > 70%

Setting a minimum number of data points before the evaluation occurs
helps here. This also prevents trying to evaluate 0/0 8-).

--
                                -- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Simple-evcorr-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users

Re: [Simple-evcorr-users] What is the proper use of eval and perl function calls? (long feature proposal too)

Reply via email to