In message <aanlktimeb0zy73gsrorp4ivs_ws-cxpkw7ybpn3h1...@mail.gmail.com>,
Clayton Dukes writes:
>I'm trying to come up with a way to match Cisco IP SLA syslog events.
>
>Messages come in from 10 different IP SLA shadow routers that look similar
>to this:
>Jun  3 03:13:53 10.48.36.33 394491: 379185: Jun  3 03:13:52.982 BST:
>%RTT-3-IPSLATHRESHOLD: IP SLAs(3419): Threshold Occurred for connectionLoss
>Jun  3 03:13:54 10.48.36.37 330506: 331273: Jun  3 03:13:53.592 BST:
>%RTT-3-IPSLATHRESHOLD: IP SLAs(3242): Threshold Occurred for timeout
>Jun  3 03:13:54 10.48.36.37 330507: 331274: Jun  3 03:13:54.688 BST:
>%RTT-4-OPER_TIMEOUT: condition cleared, entry number = 3364
>Jun  3 03:13:54 10.48.36.39 331498: 332112: Jun  3 03:13:53.916 BST:
>%RTT-3-IPSLATHRESHOLD: IP SLAs(3263): Threshold Occurred for timeout
>Jun  3 03:13:55 10.48.36.37 330508: 331275: Jun  3 03:13:54.704 BST:
>%RTT-3-IPSLATHRESHOLD: IP SLAs(3364): Threshold Cleared for timeout
>Jun  3 03:13:56 10.48.36.39 331499: 332113: Jun  3 03:13:56.816 BST:
>%RTT-3-IPSLATHRESHOLD: IP SLAs(1398): Threshold exceeded for packetLossDS
>Jun  3 03:13:56 10.48.36.39 331500: 332114: Jun  3 03:13:56.916 BST:
>%RTT-4-OPER_TIMEOUT: condition occurred, entry number = 3321
>Jun  3 03:13:57 10.48.36.39 331501: 332115: Jun  3 03:13:56.932 BST:
>%RTT-3-IPSLATHRESHOLD: IP SLAs(3321): Threshold Occurred for timeout
>Jun  3 03:13:57 10.48.36.39 331502: 332116: Jun  3 03:13:57.812 BST:
>%RTT-3-IPSLATHRESHOLD: IP SLAs(1402): Threshold exceeded for packetLossDS
>Jun  3 03:14:01 10.48.36.42 184927: 185372: Jun  3 03:14:01.167 BST:
>%RTT-4-OPER_TIMEOUT: condition cleared, entry number = 4030
>Jun  3 03:14:02 10.48.36.42 184928: 185373: Jun  3 03:14:01.179 BST:
>%RTT-3-IPSLATHRESHOLD: IP SLAs(4030): Threshold Cleared for timeout
>Jun  3 03:14:07 10.48.36.37 330510: 331277: Jun  3 03:14:06.005 BST:
>%RTT-3-IPSLATHRESHOLD: IP SLAs(1442): Threshold below for packetLossDS
>Jun  3 03:14:07 10.48.36.42 184929: 185374: Jun  3 03:14:07.936 BST:
>%RTT-4-OPER_TIMEOUT: condition occurred, entry number = 4096
>Jun  3 03:14:08 10.48.36.33 394492: 379186: Jun  3 03:14:07.942 BST:
>%RTT-4-OPER_TIMEOUT: condition cleared, entry number = 3418
>Jun  3 03:14:08 10.48.36.33 394493: 379187: Jun  3 03:14:07.954 BST:
>%RTT-3-IPSLATHRESHOLD: IP SLAs(3418): Threshold Cleared for timeout
>
>For example:
>I need to find a "condition occurred, entry number = 4030" and match it to
>"condition cleared, entry number = 4030"
>The time frame doesn't matter - what does matter is that if I receive
>another "condition occurred, entry number = 4030" before I receive a clear
>for that probe number (4030 in this case), then that means I lost a syslog
>message somewhere (since it's impossible to get a new condition without a
>clear). In this case, I need to trigger an email.
>
>I need to do this for every device and unique probe # (there are thousands).

Where are the probe number and device identifier in the above syslog
messages? Once you have that, a pair rule and a single rule will work
fine.

Let's assume the alert and clear are these two messages:

alert:
>Jun  3 03:14:08 10.48.36.33 394492: 379186: Jun  3 03:14:07.942 BST:
   %RTT-4-OPER_TIMEOUT: condition cleared, entry number = 3418

clear:
>Jun  3 03:14:08 10.48.36.33 394493: 379187: Jun  3 03:14:07.954 BST:
   %RTT-3-IPSLATHRESHOLD: IP SLAs(3418): Threshold Cleared for timeout

They match because of the occurrence of 10.48.36.33 as the device
identifier and the entry/probe number 3418 occurring in both messages.

First create a rule that emails if you see an alert while you were
still looking for an clear for that alert.

  type = single
  desc = email when an alert is seen while one is asserted
  continue = takenext
  context = alert_$1_$2
  ptype = regexp
  rem = $1 is ip address, $2 is entry number
  pattern = ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}).*entry number = 
(\d+)
  action = pipe '$0' /bin/mail -v "Dropped syslog entry. Found alert while it 
was pending" root; delete alert_$1_$2; reset +1 match the alert for host $1 and 
event $2

This parses a line that matches the alert line above. It captures the
ip address and the entry number. It then uses that info to see if a
context called "alert_10.48.36.33_3418" (using the alert example
above) exists. If the context doesn't exist the rule is skipped and no
email is sent. If the context does exist:

   send email
   delete the context
   cancel a the correlation started by the next rule (+1 is the next rule
       in the file) that has a key (desc field) that matches "match the
       alert for host 10.48.36.33 and event 3418". This allows the pair
       rule following to start a new correlation operation.
   pass the current event to the next rule (because of continue=takenext)

The next rule looks for an looks for the alert/clear pattern.

  type = pair
  desc = match the alert for host $1 and event $2
  ptype = regexp
  rem = same pattern as in single rule
  pattern = ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}).*entry number = 
(\d+)
  action = create alert_$1_$2
  desc2 = match the clear for host $1 and event $2
  ptype2 = regexp
  pattern2 = $1.*SLAs\($2\)
  rem = %1 and %2 are used to reference $1 and $2 from the first pattern
  rem = since this is run after pattern2 is matched, $1 and $2 are
  rem = reassigned by the pattern2 match
  action2 = delete alert_%1_%2; shellcmd do something else useful
  time = 0

Ok, let's see how these work. Let's assume you have just started so
the context alert_10.48.36.33_3418 is not defined.

Scenario 1: the alert arrives. It matches the pair rule and starts a
correlation operation looking for the clear. Then the clear comes in,
it matches the correlation started by the pair rule and something
useful happens.

Scenario 2: the alert arrives. At this point the alert_10.48.36.33_3418
context is defined.  Its clear is dropped. Another alert arrives. It
now matches the first rule (type = single) and sends email. The single
rule resets the pair rule and the alert is now passed to the pair rule
where it creates the new alert_10.48.36.33_3418 context and waits for
a clear to arrive. If a clear arrives, the pair rule captures it (the
single rule ignores the clear since it doesn't match the pattern). If
another alert arrives again email is sent and the old pair correlation
is cancelled and a new pair correlation is opened.

If the single rule didn't reset the pair rule the alert would be
consumed and ignored by the pair rule and not start a new correlation.
Depending on what you want to do if the pair is discovered, resetting
the pair correlation may or may not matter.

>1.1.1.1, probe 1 should have matching pairs
>1.1.1.1, probe 2 should have a matching pair
>2.2.2.2, probe 3 should match
>etc.

Not sure how your "probes" relate to the message sequences you show
above, but hopefully this gives you some idea of how to handle it.

(Note I didn't test the rulesets above and they may be missing
required keywords, may not work on alternate Fridays etc. No bits were
harmed writing this message also some brain cells were quite annoyed
8-).)

--
                                -- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Simple-evcorr-users mailing list
Simple-evcorr-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users

Reply via email to