"By "subsequent node down events" do you mean node down from the same node (duplicate event issue), or do you mean node down from nodes that unreachable because one node went down (more of a topology problem)?"
Neither. The monitoring application only generates a duplicate if the previous node down was cancelled by a node up (i.e., they always occur in down/up pairs). Within the interval of time, the goal is to capture any and all uncorrelated down events. They may or may not be topologically related. "What would a typical event sequence look like? I can see the following with a 1 minute period to allow them to clear: node a down node b down node c down node a down (This duplicate would not occur without a prior "node a up") no activity for 30 seconds node a up no activity for 25 seconds node b up ARS notified that node a and node c are down node c up ARS notified that node C is back (ARS would only be updated once when all nodes return to normal status)" This would be a correct sequence except for the corrections noted above. Your thoughts about maintaining a state context and implementing a counting mechanism may do the trick. I will experiment with this approach to see where it gets me. Thanks for the valuable input... Art -----Original Message----- From: John P. Rouillard [mailto:rou...@cs.umb.edu] Sent: Sunday, November 29, 2009 1:00 PM To: Smolecki, Art (OET) Cc: simple-evcorr-users@lists.sourceforge.net Subject: Re: [Simple-evcorr-users] Correlation rules based on time and paired events In message <04b8e93534ea994cad8c5ef5cad7ec1f38f1acb...@mnmail03.ead.state.mn.us >, "Smolecki, Art (OET)" writes: >I would like to process paired events (node down/node up) >in the following manner: > * Beginning with the first occurrence of a "node down" > event, create a context used to collect this and all > subsequent node down events within a predetermined time > interval. By "subsequent node down events" do you mean node down from the same node (duplicate event issue), or do you mean node down from nodes that unreachable because one node went down (more of a topology problem)? If the former using a pair rule will ignore (and consume) the original "node down" event, so if you want to accumulate all of them you will need to use three of linked single rules: pattern= node (...) down context = ! node_down_$1 action = create node_down_$1 60; add node_down_$1 $0;\ report node_down_$1 pair_events_and_report .... pattern= node (...) down context = node_down_$1 action = add node_down_$1 $0 pattern= node (...) up context = node_down_$1 action = add node_down_$1 this will create/capture the node up/down events into the node_down_$1 context for a given $1 node. I am not sure how to manipulate the context to eliminate a node down event for each node up event, so I put that magic into the reporting script 8-) which can pair up the down/up events and only report on unmatched events. If you are looking to solve the topology problem, in general I don't have a good solution. The ones I have come up with don't scale well. However using the singlewithscript command and running a script to query hpov's topo database may work for your application. > * Correlate the "node up" events to eliminate the > corresponding down events within the context (Pair > rule?). Well editing the contents of a context isn't well defined. You can do it using context -> variable (copy) and variable -> context (fill) assignments along with perl functions (call or eval) but.... > * At some point in time, the context expires and reports > the contents to a script which opens a trouble ticket in > our ARS system (The idea is to group similar events > occurring in a relatively short time interval into a > single notification instead of reporting each event in > its own notification or trouble ticket). > > * For what remains in the context at the time of > reporting, continue correlating "node up" events until > all are determined to be up. When all are determined to > be up, a script would execute to close the trouble > ticket or send an "all clear" notification. What would a typical event sequence look like? I can see the following with a 1 minute period to allow them to clear: node a down node b down node c down node a down no activity for 30 seconds node a up no activity for 25 seconds node b up ARS notified that node a and node c are down node c up ARS notified that node C is back At this point you still have an open ticket in ARS that node A is down. Node b's down/up transition isn't in ARS at all and node c's down/up transition is ticketed in ARS but is closed. If that's what you are looking for then I think that can be done, the only wrinkle is getting the two node a down events to correlate against their respective node up events (if indeed there is a way to do that). > The issue I am running into is keeping state maintained as > the initial "node down" context expires to allow the > continuation of the "node up" correlation for any > remaining down events. Well you could use two contexts, one indicating state and the other containing the events. action = create node_down_$1 60; create waiting_for_all_node_up_events_$1; where waiting_for_all_node_up_events_$1 is deleted only when all the node down events are matched by a node up event. Hmm, what you may want to do is explicitly count the node events and inclement it for node down and decrement it for node up. Maybe using something like: # initial case no pending up/down events pattern= node (...) down context = ! ( node_down_$1 && waiting_for_all_node_up_events_$1) action = create node_down_$1 60 \ report node_down_$1 ...; \ add node_down_$1 $0; \ eval %a =(++$b{'$1'}) ; create waiting_for_all_node_up_events_$1; # node down while node_down_$1 still exists pattern= node (...) down context = node_down_$1 action = add node_down_$1 $0; eval %a =(++$b{'$1'}); # node up while node_down_$1 still exists patterm = node (...) up context = node_down_$1 && =($b{'$1'} > 0) action = eval %a =(++$b{'$1'}); # node down while waiting for nodes but node_down_$1 has expired # and the count is greater than 1 pattern = node (...) up context = waiting_for_all_node_up_events_$1 && =($b{'$1'} > 1) action = eval %a (--$b); # the count has reached 0 pattern = node (...) up context ! =($b{'$1'} > 1) action = delete waiting_for_all_node_up_events_$1; to explictly count the node up/down events? I am sure the rules above are kind of screwy (I think there are a couple of states I am not handling in the rules), but hopefully that will give somebody else or you some ideas. > Also, it is possible that upon > expiration, the context will be empty which would require > some type of sanity checking at the point in the rule > where expiration occurs. Or the report script can just ignore the request if it doesn't receive anything on stdin. > Has anyone implemented anything similar to this? Maybe > using different rule logic? I would very much appreciate > any feedback. Well here is the feedback, and it's worth what you paid for it 8-). -- -- rouilj John Rouillard =========================================================================== My employers don't acknowledge my existence much less my opinions. ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Simple-evcorr-users mailing list Simple-evcorr-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users