John,
The sketch you offered should accomplish most of what I want. The one
issue remaining is that the overflow is a counter rather than a guage.
I am capturing socket overflows, dropped full socket buffers,
udpInOverflows and packet receive errors from netstat -s. I wrote the
monitor a number of years ago.. I suppose I could just keep a local file
to cache the last counter, but I was hoping I wouldn't need to do that.
Is there a way to cache values from the previous run in a context?
The log traces are run every 5 minutes off of cron, and are a self
measure of performance (or lack thereof).
Thanks a lot for getting me started. I am fairly new to SEC and I
couldn't seem to get my head around the problem.
Regards,
Tim Peiffer
John P. Rouillard wrote:
> In message <[EMAIL PROTECTED]>,
> Tim Peiffer writes:
>
>> What is the proper use of eval and perl function
>> calls for comparisons?
>>
>> Consider the following log trace:
>> 61 msec sz 212456 rss 210232 sock ovfl 244477
>>
>> Given that the log lines has timings, process sizes and count of UDP
>> socket overflows, I wish to compare say the last 5 traces, and if all
>> have timings that exceed X msec, I wish to restart the service.
>>
>
> Case 1, no problem.
>
>
>> Similarly If I receive more than Y socket overflows in the past 5
>> traces, I wish to restart
>>
>
> rss is Case 2, I will assume you mean the prior 5 consecutive log entries
> had
> a sock ovfl > Y and not the sum of the socket overflows for the prior
> 5 entries is > Y. The latter is left as an exercise for the reader
> (although a hint is given below).
>
>
>> And again, if the process size is greater than Z Mbyte, I wish to restart.
>>
>
> Case 3. Not a problem.
>
> One question are the log lines emitted on a regularly timed basis, or
> do they arrive randomly so you can't say that 5 log lines arrives in 5
> minutes? For the example below I assume that the lines arrive
> randomly and we are counting consecutive occurrences.
>
> SEC's threshold windows are time based not event based so counting M
> matching events in a window of N events usually requires that the
> events arrive on a regular schedule so the N event window can be
> expressed as a period of time.
>
> So something like:
>
> ==============
> # case 3 first
> type=single
> desc= rule 1: size too large restart
> rem= use takenext so every event is analyzed 3 times:
> rem= size, time, overflows
> continue=takenext
> ptype=pattern
> rem= $1 is the time to run, $2 is the size,
> rem= and $4 is overflows
> rem= same pattern is used for every rule below because I am lazy
> pattern= ([0-9]+) msec sz ([0-9]+) rss ([0-9]+) sock ovfl ([0-9]+)
> rem= if size greater than 512000 bytes restart
> context = =($2 > 512000)
> action = shellcmd /etc/init.d/program restart
>
> # case 1
> type=singlewiththreshold
> desc= rule 2: each of the last 5 consecutive runs took too long
> continue=takenext
> rem= I don't remember if 0 is allowed here. If not we don't want the
> rem= window to expire and slide, so set to a really large number.
> window=0
> thresh=5
> ptype=pattern
> pattern= ([0-9]+) msec sz ([0-9]+) rss ([0-9]+) sock ovfl ([0-9]+)
> context = =($1 > X)
> rem= restart the program and reset the rule, otherwise the rule won't
> rem= fire again untill the window runs out (which is never).
> action = shellcmd /etc/init.d/program restart; reset 0 %s
>
> type=single
> desc= rule 3: timing ok for this run, reset threshold test
> continue=takenext
> ptype=pattern
> pattern= ([0-9]+) msec sz ([0-9]+) rss ([0-9]+) sock ovfl ([0-9]+)
> context = =($1 <= X)
> rem= reset the prior single with threshold as we have received
> rem= a log entry in which the timing is ok, and the 5 count
> rem= above needs to start from 0 again.
> action = reset -1 rule 2: each of the last 5 consecutive runs took too long
>
> # case 2
> # see comments for prior rule pair and apply below. Same idea
> # just different parameter
> type=singlewiththreshold
> desc= rule 4: each of last 5 consecutive runs had more than Y overflows
> continue=takenext
> window=0
> thresh=5
> ptype=pattern
> pattern= ([0-9]+) msec sz ([0-9]+) rss ([0-9]+) sock ovfl ([0-9]+)
> context = =($4 > Y)
> action = shellcmd /etc/init.d/program restart: reset 0 %s
>
> # note no cont=takenext, this consumes the event.
> type=single
> desc= overflows ok for this run, reset threshold test
> ptype=pattern
> pattern= ([0-9]+) msec sz ([0-9]+) rss ([0-9]+) sock ovfl ([0-9]+)
> context = =($4 <= Y)
> action = reset -1 rule 4: each of last 5 consecutive runs had more than Y
> overflows
> ============
>
> If you need to count the total number of socket resets over the prior
> 5 events and only if the sum is > Y do you restart, a context like:
>
> =(unshift @overflow, $4; $#overflow=4; $sum=0; map {$sum+=$_} @foo; return
> $sum > Y)
>
> may work. What this does is adds the new overflow value ($4) at the
> front of the @overflow array. Then it removes the 6th or larger
> element in the array (i.e. anything with index > 4). Then use map to
> sum all the elements in the array and compare the sum against Y.
>
> Tim can probably stop reading now as the following is a discussion on
> how to make this easier and more flexible in the case where the events
> you want to count don't arrive on a regular schedule. I have been
> fighting this issue for a while (since just after the 2.0 release of
> SEC) and come up with some very hairy multi-rule correlations that
> have caused me to lose even more of my hair (so not I won't share
> them). While I think it can be reduced to a couple of rules and some
> perl functions, it's still messy and requires keeping multiple rules
> in sync, and it seems like something that the threshold rules should
> support natively.
>
> >From this example there are a couple of places SEC could be made better:
>
> 1) support a window value of 0 for threshold operations (if not
> already supported)
>
> so that the window is infinite and never slides.
>
> What would also be nice is some way to have a window that is not time
> based but event based. One idea I had was to allow the threshold rule
> to count two different categories of events. To do this we add an
> eventwindowcontext parameter and an eventwindow parameter:
>
> type=singlewiththreshold
> desc= rule 4: each of last 5 consecutive runs had more than Y overflows
> continue=takenext
> window=0
> thresh=5
> ptype=pattern
> pattern= ([0-9]+) msec sz ([0-9]+) rss ([0-9]+) sock ovfl ([0-9]+)
> context = =($4 > Y)
> eventwindow=5
> eventwindowcontext = =($4 <= Y)
>
> where eventwindow is the number of events that should be in the window
> where thresh is counted. (Note @@, it's late and the code and
> examples below actually implement the number of events in the window
> as eventwindow-1. Sorry, but I am not going to go back and fix that.)
> Eventwindowcontext is a context that selects events that should be in
> the window, but not counted towards the thresh. Note that thresh must
> be less than or equal to eventwindow, if it is larger it will never be
> met.
>
> So our two categories for events are:
>
> 1) ones where context (if present) is true
> 2) ones where another context (eventwindowcontext) (if present) is true
>
> both events categories are counted as eventwindow events, but only
> events in 1 are counted as thresh events. So we have two counters:
> thresh counter (tc) and eventwindow counter (ec) and things work as
> follows:
>
> When an event comes in:
>
> if event not matched by pattern
> skip the event
>
> if context is enclosed in []'s and is false
> skip the event
>
> # event matches pattern past here
>
> if neither context nor eventwindowcontext defined
> increment tc (treat as though context is true,
> same as current operation)
>
> if context is defined and eventwindowcontext is not defined *
> if context is true: increment tc
> if context is false: skip event
>
> if context is not defined and eventwindowcontext is defined **
> increment tc and ec
>
> if both context and eventwindowcontext are defined
> if context is true and eventwindowcontext is any:
> increment ec and increment tc
> if if context is false and eventwindowcontext is true:
> increment ec
> if context is false and eventwindowcontext is false:
> skip event
> if context is true and eventwindowcontext is true:
> well this is most likely an error, but context takes
> precedence so it increments ec and tc.
>
>
> if window != 0 and the time from the oldest event to the
> newest event exceeds 'window' ***
> for each event older then current time-window:
> (note same as rules when ec = eventwindow)
> if the event incremented tc when it was added
> decrement both tc and ec by 1
> (unless ec is 0 which is it's min value)
> else (the oldest event must only have incremented ec)
> decrement only ec by 1
>
> if the event causes tc to equal thresh: trigger the action
> and remain idle until current time is (oldest event time)+window
> (same as current)
>
> if the event causes ec to equal eventwindow: slide the window
> by removing the oldest event (see note @@ above)
> if the oldest event incremented tc when it was added
> decrement both tc and ec by 1
> (unless ec is 0 which is it's min value)
> else (the oldest event must only have incremented ec)
> decrement only ec by 1
>
>
> * If eventwindow is defined in this case, I claim it's an error.
> If it is not defined, then this is the current threshold.
>
> ** Not defining both context and eventwindowcontext is probably an
> error as well.
>
> *** if window = 0, there is no time based window and this path
> is never run.
>
> Ok, now for a couple of examples.
>
> So start an example where we have 5 consecutive true values for
> context, and window=0, eventwindow=5, thresh=5:
>
> event 1 context true (thresh count tc=1 eventwindow count ec=1)
> event 2 context true (tc=2 ec=2)
> event 3 context true (tc=3 ec=3)
> event 4 context true (tc=4 ec=4)
> event 5 context true (tc=5 ec=5)
>
> at event 5, tc=thresh and the threshold rule executes the action.
> Since window=0, the rule must be reset in the action otherwise it will
> never fire again.
>
> Now let's see what happens if we have a non context matching event at
> 2 with window=0, eventwindow=5, thresh=5:
>
> event 1 context true (tc=1 ec=1)
> event 2 does not match context, but matches eventwindowcontext (tc=1 ec=2)
> event 3 context true (tc=2 ec=3)
> event 4 context true (tc=3 ec=4)
> event 5 context true (tc=4 ec=5) window slide (tc=3, ec=4)
> event 6 context true (tc=4 ec=5) window slide (tc=4 ec=4)
> event 7 context true (tc=5 ec=5)
>
> Now we see an event 2 that doesn't increment the thresh count, but
> does increment the eventwindow count. When we reach event 5, tc is <
> thresh, so an action isn't executed. But ec = eventwindow and just as
> though we had exceeded a time window, the comparison window shifts
> event 1 out and the counts are changed to tc=3, ec=4. Then event 6
> comes in, tc is still < thresh and ec is once again equal to 5. So
> shift event 2 out of the window. Event 2 didn't match 'context' when
> it was accepted, so we only end up decrementing ec and not tc so tc=4
> and ec=4. Now event 7 comes in and tc=5 and ec=5 and the action
> triggers.
>
> What happens if window=0, eventwindow=5, thresh=3 (so we need 60% of
> the event to trigger) with the 7 event sequence above and window still
> 0:
>
> event 1 context true (tc=1 ec=1)
> event 2 context false, eventwindowcontext true (tc=1 ec=2)
> event 3 context true (tc=2 ec=3)
> event 4 context true (tc=3 ec=4)
>
> At event 4 we fire the threshold action. Note that ec <
> eventwindow. This is fine since regardless of what happens when we
> reach ec=eventwindow we have met the threshold.
>
> Ok, lets take the last two cases and run them with window=60.
>
> Now let's see what happens if we have a non context matching event at
> 2 with window=60 eventwindow=5, thresh=5:
>
> t=0 event 1 context true (tc=1 ec=1)
> t=10 event 2 context false, eventwindowcontext true (tc=1 ec=2)
> t=15 event 3 context true (tc=2 ec=3)
> t=20 event 4 context true (tc=3 ec=4)
> t=30 event 5 context true (tc=4 ec=5) window slide (tc=3, ec=4) *
> t=71 event 6 context true (tc=4 ec=5) window slide (tc=4 ec=4) **
> t=72 event 7 context true (tc=5 ec=5)
>
> When we reach event 5, ec = eventwindow and we shift event 1 away as
> before. Then event 6 comes in, and event 2 is the oldest. Because
> time(event 6) - time(event 2) = 61 > window (60) we shift event 2
> away. So the window slide at * was due to exceeding the eventwindow
> size. But the window shift at ** was due to exceeding the timing
> constraint. As before event 2 didn't match 'context' when it was
> accepted, so we only end up decrementing ec and not tc so tc=4 and
> ec=4. Now event 7 comes in and the time constraint is ok, tc=5 and
> ec=5 and the action triggers.
>
> What happens if thresh=3 and eventwindow=5 (so we need 60% of the
> event to trigger) with the 7 event sequence above and window=15 and a bit
> different timing:
>
> t=0 event 1 context true (tc=1 ec=1)
> t=10 event 2 context false, eventwindowcontext true (tc=1 ec=2)
> t=31 event 3 context true (tc=2 ec=3) window slide (tc=1 ec=1)
> t=40 event 4 context true (tc=2 ec=2)
> t=47 event 5 context true (tc=3 ec=3) window slide (tc=1, ec=1)
> t=50 event 6 context true (tc=3 ec=2)
> t=53 event 7 context true (tc=3 ec=3)
>
> Events 1 and 2 proceed normally. At event 3, events 1 and 2 are
> outside the window and are discarded leaving only event 3. Events 4
> and 5 arrived, but the arrival of event 5 causes the window to be
> exceeded and events 3 nd 4 are discarded. Then events 6 and 7 arrive
> so that 5, 6, 7 are all within the window and the action is triggered.
>
> The pseudocode for the SingleWith2Thresholds rule is the same for the
> first threshold, and the second threshold looks like the first
> threshold algorithm up to the point where the values of tc and ec are
> checked. The window is adjusted first and tc must be less than thresh.
>
> if the event causes ec to equal eventwindow, slide the window
> by removing the oldest event
> if the oldest event incremented tc when it was added
> decrement both tc and ec by 1
> (unless ec is 0 which is it's min value)
> else (the oldest event must only have incremented ec)
> decrement only ec by 1
>
> if the event causes tc to drop below thresh: trigger the action
> and end correlation
>
> so we change the window size before the threshold check.
>
> Quips, comments, evasions, questions or answers?
>
>
> Also if this is implemented, then it should be possible to handle one
> more correlation detection operation over a random amount of data:
>
> I want to detect a failure rate of more than 70% using a 5 minute window
>
> or it's equivalent:
>
> I want to detect a passing rate of less than 30% using a 5 minute window
>
> Using a syntax like:
>
> type=singlewiththreshold
> window=300
> thresh=70% <-- note the % sign
> context = =($4 eq 'fail') <-- increments tc and ec
> eventwindowcontext = =($4 eq 'pass') <-- increments ec only
>
> The failure rate is simply tc/ec (tc is always less than ec if
> eventwindowcontext is defined). However there are some gotcha's lurking here.
>
> What happens when your first event is a failure? 100% failure > 70%
>
> Setting a minimum number of data points before the evaluation occurs
> helps here. This also prevents trying to evaluate 0/0 8-).
>
> --
> -- rouilj
> John Rouillard
> ===========================================================================
> My employers don't acknowledge my existence much less my opinions.
>
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Simple-evcorr-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users