Re: Attn: sed(1) regular expression gurus

2003-07-15 Thread Steve Coile
On Mon, 14 Jul 2003, D J Hawkey Jr wrote:
I'm getting really frustrated by a seemingly simple problem. I'm doing
this under FreeBSD 4.5.

Given these portions of an e-mail's multi-line Received header as tests:

  by some.host.at.a.com (Postfix) with ESMTP id 3A4E07B03
  by some.host.at.a.com (8.11.6) ESMTP;
  by some.host.at.a.different.com (8.11.6p2/8.11.6) ESMTP;
  by some.host.at.another.com ([123.4.56.789]) id 3A4E07B03
  by some.host.at.yet.another.com (123.4.56.789) id 3A4E07B03

# tested with sed-4.0.5-1 for RHL 9.0

# remove junk we don't care about
s/^.*by \([^ ]*\) (\([^)]*\)).*$/\1 \2/
 
# identify valid hostname
s/^[[:alnum:]][-[:alnum:]]*\(\.[[:alnum:]][-[:alnum:]]*\)*/host:/
 
# identify valid IP address (w/o brackets)
s/[[:digit:]]\{1,3\}\(\.[[:digit:]]\{1,3\}*\)\{3\}$/ipaddr:/
 
# identify valid IP address (w/brackets)
s/\[\([[:digit:]]\{1,3\}\(\.[[:digit:]]\{1,3\}*\)\{3\}\)\]$/ipaddr:\1/
 
# discard if no valid hostname or IP address
/\(^host:\| ipaddr:\)/!d
 
# if valid IP address, discard anything else
s/^.* ipaddr://
 
# if valid hostname, discard anything else
s/^host:\([^ ]*\).*$/\1/

-- 
Steve Coile
Systems Administrator
Nando Media
ph: 919-861-1200
fax: 919-861-1300
e-mail: [EMAIL PROTECTED]
http://www.nandomedia.com
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: sed(1) regular expression gurus - SOLUTION

2003-07-15 Thread D J Hawkey Jr
First off, thanks to all of you who scratched their heads over this
puzzle. All had the right idea to some extent or another.

Based in part on the replies, and my own work, here's the final result:

FOLDER=$HOME/Mail/spam

NAME_RE=[[:alnum:]_.-]+
ADDY_RE=([0-9]{1,3}\.){3}[0-9]{1,3}

cat $FOLDER \
  |grep -A 5 ^Received: \
  |egrep ^(Received:|  ) \
  |sed -E \
-e s/(^Received:|by|from)[[:space:]]+//g \
-e s/\([HELO]{4}[[:space:]]+($NAME_RE)\)/\1/ \
-e s/\(($NAME_RE)[[:space:]]+\[($ADDY_RE)\]\)/\1 \2/g \
-e s/(\(\[?|\[)($ADDY_RE)(\]|\]?\))/\2/g \
-e s/[[:space:]]*(\(|id|via|with|E?SMTP|;).*// \
-e s/(\(envelope-|for|Sun|Mon|Tue|Wed|Thu|Fri|Sat).*// \
-e s/[][(){}]//g \

Note that the whitespace in the second pipe is one tab character.

The first two pipes isolate the multi-line headers. The first sed
command strips keywords and any following whitespace. The second
sed command returns the name in a parenthetical HELO or EHLO. The
third sed command returns the name and address in a (... [...]).
The fourth sed command - the one I inquired about - returns the
address in any of (...), ([...]), or [...]. The fifth sed
command strips possible whitespace, keywords or an opening
parenthesis (now that it's of no consequence), and anything after
them. The sixth sed command strips more keywords and anything
after them (it might be merged into the fifth, what it strips is
often on another line). Finally, the last sed command strips any
errant delimiters; strictly speaking, it's redundant, but when I
ran a spam file (~12.3Mb) through this, some delimiters did leak
through.

Just thought those that replied to my plea might like to see this,
and perhaps somebody else will find it useful. No, I'm not telling
what it's for.  ;-,

Dave

-- 
  __ __
  \__   \D. J. HAWKEY JR.   /   __/
 \/\ [EMAIL PROTECTED]/\/
  http://www.visi.com/~hawkeyd/

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Attn: sed(1) regular expression gurus

2003-07-14 Thread D J Hawkey Jr
Hi all.

I'm getting really frustrated by a seemingly simple problem. I'm doing
this under FreeBSD 4.5.

Given these portions of an e-mail's multi-line Received header as tests:

  by some.host.at.a.com (Postfix) with ESMTP id 3A4E07B03
  by some.host.at.a.com (8.11.6) ESMTP;
  by some.host.at.a.different.com (8.11.6p2/8.11.6) ESMTP;
  by some.host.at.another.com ([123.4.56.789]) id 3A4E07B03
  by some.host.at.yet.another.com (123.4.56.789) id 3A4E07B03

I want to isolate the addresses (one for the 1st through 3rd, two for
the 4th and 5th). Here's the sed(1) command I'm playing with:

  echo by nospam.mc.mpls.visi.com (Postfix) with ESMTP id 3A4E07B03 \
  |sed -E \
-e s/by[[:space:]]+// \
-e s/(\((\[?([0-9]{1,3}\.){3}[0-9]{1,3}\]?){0}\)|id|with|E?SMTP).*//

In all cases, the parenthetical word is returned, when only the last
two should return the parenthetical word. The idea behind the first
branch of the second sed(1) command is to match anything that isn't a
digits.digits.digits.digits pattern. I've tried simpler expressions
like \(\[?[^0-9.]+\]?\), but it fails on the third example.

What the devil am I doing wrong?? Am I exercizing known bugs in GNU's
sed(1)? Can anyone dream up a different solution - please, no Perl, but
awk(1) is fine.

Thanks,
Dave

-- 
  __ __
  \__   \D. J. HAWKEY JR.   /   __/
 \/\ [EMAIL PROTECTED]/\/
  http://www.visi.com/~hawkeyd/

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: sed(1) regular expression gurus

2003-07-14 Thread Rob
OK, here's a solution using awk - may be possible in sed, but awk has
more control statements for this kind of thing:

awk --posix -F'[^0-9A-Za-z.]+' '
  $1 ~ /by/ { result = $2
for (i=3; i=NF; i++) {
  if ($i ~ /^([0-9]+\.){3}[0-9]+$/) {
result = result   $i
  }
}
  print result
  }'

* Use the field separator to throw away anything that isn't a number,
letter or periodic - don't have to worry about brackets anymore
* Match lines starting with 'by' and save the second word (which should
be a hostname)
* Check the following words - if they match an IP address, they're saved
too
* Then print the result!

There may be 'neater' ways of doing it, but it's the most concise
example I could come up with.

You need to include the --posix option to get the '{3}' notation to work
(peculiar to GNU awk).

- Original Message -
From: D J Hawkey Jr [EMAIL PROTECTED]
Subject: Attn: sed(1) regular expression gurus


 Hi all.

 I'm getting really frustrated by a seemingly simple problem. I'm doing
 this under FreeBSD 4.5.

 Given these portions of an e-mail's multi-line Received header as
tests:

   by some.host.at.a.com (Postfix) with ESMTP id 3A4E07B03
   by some.host.at.a.com (8.11.6) ESMTP;
   by some.host.at.a.different.com (8.11.6p2/8.11.6) ESMTP;
   by some.host.at.another.com ([123.4.56.789]) id 3A4E07B03
   by some.host.at.yet.another.com (123.4.56.789) id 3A4E07B03

 I want to isolate the addresses (one for the 1st through 3rd, two for
 the 4th and 5th). Here's the sed(1) command I'm playing with:

   echo by nospam.mc.mpls.visi.com (Postfix) with ESMTP id 3A4E07B03
\
   |sed -E \
 -e s/by[[:space:]]+// \
 -e
s/(\((\[?([0-9]{1,3}\.){3}[0-9]{1,3}\]?){0}\)|id|with|E?SMTP).*//

 In all cases, the parenthetical word is returned, when only the last
 two should return the parenthetical word. The idea behind the first
 branch of the second sed(1) command is to match anything that isn't a
 digits.digits.digits.digits pattern. I've tried simpler expressions
 like \(\[?[^0-9.]+\]?\), but it fails on the third example.

 What the devil am I doing wrong?? Am I exercizing known bugs in GNU's
 sed(1)? Can anyone dream up a different solution - please, no Perl,
but
 awk(1) is fine.

 Thanks,
 Dave

 --
   __
__
   \__   \D. J. HAWKEY JR.   /
__/
  \/\ [EMAIL PROTECTED]/\/
   http://www.visi.com/~hawkeyd/

 ___
 [EMAIL PROTECTED] mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-questions
 To unsubscribe, send any mail to
[EMAIL PROTECTED]


___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: sed(1) regular expression gurus

2003-07-14 Thread D J Hawkey Jr
On Jul 15, at 12:49 AM, Rob wrote:
 
 awk --posix -F'[^0-9A-Za-z.]+' '
   $1 ~ /by/ { result = $2
 for (i=3; i=NF; i++) {
   if ($i ~ /^([0-9]+\.){3}[0-9]+$/) {
 result = result   $i
   }
 }
   print result
   }'
 
 There may be 'neater' ways of doing it, but it's the most concise
 example I could come up with.

This is better than anything I've dreamed up with sed or awk, and is
really close, but it fails on this:

  by nospam.mc.mpls.visi.com (8.11.6/8.11.6.2) with ESMTP id 3A4E07B03

The parenthetical is a [hacked] sendmail version. I don't see how the
script fails, though, as you do test for a full/complete dotted quad,
and even test for a BOL and EOL on either side it. The 8.11.6 shouldn't
match. I changed the '+'es to {1,3}s for even better precision in the
if (...), but it didn't make any difference (nor should it have).

BTW, why the one or more flag in the FS assignment?

 You need to include the --posix option to get the '{3}' notation to work
 (peculiar to GNU awk).

Kinda throws portability out the window, but I'll settle for it.

Dave

-- 
  __ __
  \__   \D. J. HAWKEY JR.   /   __/
 \/\ [EMAIL PROTECTED]/\/
  http://www.visi.com/~hawkeyd/

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: sed(1) regular expression gurus

2003-07-14 Thread D J Hawkey Jr
On Jul 14, at 11:04 AM, D J Hawkey Jr wrote:
 
 On Jul 15, at 12:49 AM, Rob wrote:
  
  awk --posix -F'[^0-9A-Za-z.]+' '
$1 ~ /by/ { result = $2
  for (i=3; i=NF; i++) {
if ($i ~ /^([0-9]+\.){3}[0-9]+$/) {
  result = result   $i
}
  }
print result
}'
 
 This is better than anything I've dreamed up with sed or awk, and is
 really close, but it fails on this:
 
   by nospam.mc.mpls.visi.com (8.11.6/8.11.6.2) with ESMTP id 3A4E07B03

Another astute fellow offered this:

  sed -E \
-e s/by[[:space:]]+// \
-e s/\(\[?(([0-9]{1,3}\.){3}[0-9]{1,3})\]?\)/\1/ \
-e s/[[:space:]]*(\(|id|via|with|E?SMTP|;).*//

The idea being to pull anything that looks like an IP address out of
parentheses first (2nd command), then junk any other parenthetical
stuff with the other cruft on the line (3rd command).

I did learn a new syntax from your script, nonetheless. Thanks,

Dave

-- 
  __ __
  \__   \D. J. HAWKEY JR.   /   __/
 \/\ [EMAIL PROTECTED]/\/
  http://www.visi.com/~hawkeyd/

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: sed(1) regular expression gurus

2003-07-14 Thread Thomas McIntyre
Dave wrote:

 This is better than anything I've dreamed up with sed or awk, and
 is really close, but it fails on this:

  by nospam.mc.mpls.visi.com (8.11.6/8.11.6.2) with ESMTP id
3A4E07B03

I know you want to avoid perl, but this kind of problem is it's
sweet spot.  The following might be incrementally better (though the
expression to recognize a dotted quad is technically incorrect):

perl -ne '
next unless /^by/; 
@f=(split)[1..2]; 
$_=pop @f; 
s/^\D*//g; 
s/\D*$//g; 
push(@f,$_) if /^(\d{1,3}\.){3}(\d{1,3})$/; 
print join( , @f).\n;
'

I'm not fluent enough to translate it but I think awk has all the
required features to do so.

Tom McIntyre


__
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: sed(1) regular expression gurus

2003-07-14 Thread Rob
Probably because I'm using FS to throw away all non-hostname
characters - by the time it gets to the sendmail version, there's
nothing to distinguish one group of 4 numbers from another.

The 'one or more' is for lines like this

  by some.host.at.another.com ([123.4.56.789]) id 3A4E07B03
^^^^^^^  ^
where the hostnames (or IPs) are separated by multiple characters. As
you've discovered, this isn't necessarily the best approach

- Original Message -
From: D J Hawkey Jr [EMAIL PROTECTED]
Subject: Re: sed(1) regular expression gurus


 On Jul 15, at 12:49 AM, Rob wrote:
 
  awk --posix -F'[^0-9A-Za-z.]+' '
$1 ~ /by/ { result = $2
  for (i=3; i=NF; i++) {
if ($i ~ /^([0-9]+\.){3}[0-9]+$/) {
  result = result   $i
}
  }
print result
}'
 
  There may be 'neater' ways of doing it, but it's the most concise
  example I could come up with.

 This is better than anything I've dreamed up with sed or awk, and is
 really close, but it fails on this:

   by nospam.mc.mpls.visi.com (8.11.6/8.11.6.2) with ESMTP id 3A4E07B03

 The parenthetical is a [hacked] sendmail version. I don't see how the
 script fails, though, as you do test for a full/complete dotted
quad,
 and even test for a BOL and EOL on either side it. The 8.11.6
shouldn't
 match. I changed the '+'es to {1,3}s for even better precision in
the
 if (...), but it didn't make any difference (nor should it have).

 BTW, why the one or more flag in the FS assignment?

  You need to include the --posix option to get the '{3}' notation to
work
  (peculiar to GNU awk).

 Kinda throws portability out the window, but I'll settle for it.

 Dave

 --
   __
__
   \__   \D. J. HAWKEY JR.   /
__/
  \/\ [EMAIL PROTECTED]/\/
   http://www.visi.com/~hawkeyd/

 ___
 [EMAIL PROTECTED] mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-questions
 To unsubscribe, send any mail to
[EMAIL PROTECTED]


___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]