Re: Search expression spanning multiple lines

Andrew Long Thu, 30 Oct 2008 10:30:55 -0700

On 27 Oct 2008, at 14:48, SysAdm wrote:

>
> Hi Andrew,
> Sure!  Here is a clipping (with identifyable info changed) that
> contains a valid, delivered email, some spam and a timesheet record.
>
I'm afraid that I'm gong to have to admit defeat on this one. I've
been fighting with it for the last few nights, and I can't find a way to
do it. At a  couple of points I thought I had a complicated solution,  
but
they all fell over under different test cases.


My initial suggestion about non-greedy falls over because of a little
gotcha documented in ':he non-greedy' (extracted below)

*non-greedy*
If a "-" appears immediately after the "{", then a shortest match
first algorithm is used (see example below).  In particular, "\{-}" is
the same as "*" but uses the shortest match first algorithm.  BUT: A
match that starts earlier is preferred over a shorter match: "a\{-}b"
matches "aaab" in "xaaab".

This means that the match will always start at the earliest start point,
and not stop until it finds the first end point. What we need for this
solution to work is a 'bulimic' match operator that prefers the latest,
rather that the earliest, start point before each stop point.

My next thoughts were to use the zero-width match operators 'he:
zero-width'. My idea was to use a start pattern identifying the 'mail
from' header being non-greedy up until the SMTP 354 message, then use  
the
zero width non-matching operator to locate transactions that don't
output an SMTP 250 message. This falls down because you can usually find
a point after the 354 where the 250 doesn't match, even if there's a
match a line or so later.

The complicated solution here would be to join the 354 match to the !250
match with a repeated group of all the possible lines between the two
messages. In the 'simple' case this involves lines detailing the message
file created, and the number of bytes transferred. In the 'complicated'
case you have to cater for the anti-virus, ant-spam scanning that might
be going on as well.

But even using * or \+ on the repeating group didn't work - they're not
quite greedy enough, and the zero-width operator stops on the line
before the 250 message, leaving us with yet more false positives.

Here's my attempt at a simple pattern (this IS going to wrap, I'm
afraid)

/^.\{23}:\s\+<--\s\+mail\s\+from:\s*<timesheet\>\_.\{-}\n.\{23}:\s\+--> 
\s\+354.*\n\%(.\{23}:\s\+message\>.*\)*\n\%(.\{23}:\s\+-->\s\+250\)[EMAIL 
PROTECTED]

See what I mean about 'simple?' Not exactly a pattern that trips off the
tongue (or even fingers!) and this is without making sure that those
first 23 characters on each line are in fact a time stamp.

My only other thought was to write a syntax file for the log, which
would let you highlight things like the socket errors as Error, and then
just look for the timesheet addresses which are followed by an Error.

regards, Andy

-- 
Andrew Long
andrew dot long at mac dot com


--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Re: Search expression spanning multiple lines

Reply via email to