Re: [Simple-evcorr-users] Correlation rules based on time and paired events

Smolecki, Art (OET) Sun, 29 Nov 2009 11:57:51 -0800

"By "subsequent node down events" do you mean node down from
the same node (duplicate event issue), or do you mean node
down from nodes that unreachable because one node went down
(more of a topology problem)?"

Neither. The monitoring application only generates a duplicate if the previous 
node down was cancelled by a node up (i.e., they always occur in down/up 
pairs). Within the interval of time, the goal is to capture any and all 
uncorrelated down events. They may or may not be topologically related.

"What would a typical event sequence look like? I can see the
following with a 1 minute period to allow them to clear:

  node a down
  node b down
  node c down
  node a down (This duplicate would not occur without a prior "node a up")
  no activity for 30 seconds
  node a up
  no activity for 25 seconds
  node b up
  ARS notified that node a and node c are down
  node c up
  ARS notified that node C is back (ARS would only be updated once when all 
nodes return to normal status)"

This would be a correct sequence except for the corrections noted above.

Your thoughts about maintaining a state context and implementing a counting 
mechanism may do the trick. I will experiment with this approach to see where 
it gets me.

Thanks for the valuable input...

Art

-----Original Message-----
From: John P. Rouillard [mailto:rou...@cs.umb.edu] 
Sent: Sunday, November 29, 2009 1:00 PM
To: Smolecki, Art (OET)
Cc: simple-evcorr-users@lists.sourceforge.net
Subject: Re: [Simple-evcorr-users] Correlation rules based on time and paired 
events 

In message <04b8e93534ea994cad8c5ef5cad7ec1f38f1acb...@mnmail03.ead.state.mn.us
>,
"Smolecki, Art (OET)" writes:

>I would like to process paired events (node down/node up)
>in the following manner:
> * Beginning with the first occurrence of a "node down"
>   event, create a context used to collect this and all
>   subsequent node down events within a predetermined time
>   interval.

By "subsequent node down events" do you mean node down from
the same node (duplicate event issue), or do you mean node
down from nodes that unreachable because one node went down
(more of a topology problem)?

If the former using a pair rule will ignore (and consume)
the original "node down" event, so if you want to accumulate
all of them you will need to use three of linked single
rules:

  pattern= node (...) down
  context = ! node_down_$1
  action = create node_down_$1 60; add node_down_$1 $0;\
           report node_down_$1 pair_events_and_report ....

  pattern= node (...) down
  context = node_down_$1
  action = add node_down_$1 $0

  pattern= node (...) up
  context = node_down_$1
  action = add node_down_$1

this will create/capture the node up/down events into the
node_down_$1 context for a given $1 node. I am not sure how
to manipulate the context to eliminate a node down event for
each node up event, so I put that magic into the reporting
script 8-) which can pair up the down/up events and only
report on unmatched events.

If you are looking to solve the topology problem, in general
I don't have a good solution. The ones I have come up with
don't scale well. However using the singlewithscript command
and running a script to query hpov's topo database may work
for your application.

> * Correlate the "node up" events to eliminate the
>   corresponding down events within the context (Pair
>   rule?).

Well editing the contents of a context isn't well defined.
You can do it using context -> variable (copy) and variable
-> context (fill) assignments along with perl functions
(call or eval) but....

> * At some point in time, the context expires and reports
>   the contents to a script which opens a trouble ticket in
>   our ARS system (The idea is to group similar events
>   occurring in a relatively short time interval into a
>   single notification instead of reporting each event in
>   its own notification or trouble ticket).
>
> * For what remains in the context at the time of
>   reporting, continue correlating "node up" events until
>   all are determined to be up. When all are determined to
>   be up, a script would execute to close the trouble
>   ticket or send an "all clear" notification.

What would a typical event sequence look like? I can see the
following with a 1 minute period to allow them to clear:

  node a down
  node b down
  node c down
  node a down
  no activity for 30 seconds
  node a up
  no activity for 25 seconds
  node b up
  ARS notified that node a and node c are down
  node c up
  ARS notified that node C is back

At this point you still have an open ticket in ARS that node
A is down. Node b's down/up transition isn't in ARS at all
and node c's down/up transition is ticketed in ARS but is
closed.

If that's what you are looking for then I think that can be
done, the only wrinkle is getting the two node a down events
to correlate against their respective node up events (if
indeed there is a way to do that).

> The issue I am running into is keeping state maintained as
> the initial "node down" context expires to allow the
> continuation of the "node up" correlation for any
> remaining down events.

Well you could use two contexts, one indicating state and
the other containing the events.

  action = create node_down_$1 60;
           create waiting_for_all_node_up_events_$1;

where waiting_for_all_node_up_events_$1 is deleted only when
all the node down events are matched by a node up event.

Hmm, what you may want to do is explicitly count the node
events and inclement it for node down and decrement it for
node up.

Maybe using something like:

  # initial case no pending up/down events
  pattern= node (...) down
  context = ! ( node_down_$1 && waiting_for_all_node_up_events_$1)
  action = create node_down_$1 60 \
                report node_down_$1 ...; \
           add node_down_$1 $0; \
           eval %a =(++$b{'$1'}) ;
           create waiting_for_all_node_up_events_$1;

  # node down while node_down_$1 still exists
  pattern= node (...) down
  context = node_down_$1
  action = add node_down_$1 $0; eval %a =(++$b{'$1'});

  # node up while node_down_$1 still exists
  patterm = node (...) up
  context = node_down_$1 && =($b{'$1'} > 0)
  action = eval %a =(++$b{'$1'});

  # node down while waiting for nodes but node_down_$1 has expired
  # and the count is greater than 1
  pattern = node (...) up
  context = waiting_for_all_node_up_events_$1 && =($b{'$1'} > 1)
  action = eval %a (--$b);

  # the count has reached 0
  pattern = node (...) up
  context ! =($b{'$1'} > 1)
  action = delete  waiting_for_all_node_up_events_$1;

to explictly count the node up/down events? I am sure the
rules above are kind of screwy (I think there are a couple
of states I am not handling in the rules), but hopefully
that will give somebody else or you some ideas.

> Also, it is possible that upon
> expiration, the context will be empty which would require
> some type of sanity checking at the point in the rule
> where expiration occurs.

Or the report script can just ignore the request if it
doesn't receive anything on stdin.

> Has anyone implemented anything similar to this? Maybe
> using different rule logic? I would very much appreciate
> any feedback.

Well here is the feedback, and it's worth what you paid for
it 8-).

--
                                -- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Simple-evcorr-users mailing list
Simple-evcorr-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users

Re: [Simple-evcorr-users] Correlation rules based on time and paired events

Reply via email to