Re: Attn: sed(1) regular expression gurus
On Mon, 14 Jul 2003, D J Hawkey Jr wrote: I'm getting really frustrated by a seemingly simple problem. I'm doing this under FreeBSD 4.5. Given these portions of an e-mail's multi-line Received header as tests: by some.host.at.a.com (Postfix) with ESMTP id 3A4E07B03 by some.host.at.a.com (8.11.6) ESMTP; by some.host.at.a.different.com (8.11.6p2/8.11.6) ESMTP; by some.host.at.another.com ([123.4.56.789]) id 3A4E07B03 by some.host.at.yet.another.com (123.4.56.789) id 3A4E07B03 # tested with sed-4.0.5-1 for RHL 9.0 # remove junk we don't care about s/^.*by \([^ ]*\) (\([^)]*\)).*$/\1 \2/ # identify valid hostname s/^[[:alnum:]][-[:alnum:]]*\(\.[[:alnum:]][-[:alnum:]]*\)*/host:/ # identify valid IP address (w/o brackets) s/[[:digit:]]\{1,3\}\(\.[[:digit:]]\{1,3\}*\)\{3\}$/ipaddr:/ # identify valid IP address (w/brackets) s/\[\([[:digit:]]\{1,3\}\(\.[[:digit:]]\{1,3\}*\)\{3\}\)\]$/ipaddr:\1/ # discard if no valid hostname or IP address /\(^host:\| ipaddr:\)/!d # if valid IP address, discard anything else s/^.* ipaddr:// # if valid hostname, discard anything else s/^host:\([^ ]*\).*$/\1/ -- Steve Coile Systems Administrator Nando Media ph: 919-861-1200 fax: 919-861-1300 e-mail: [EMAIL PROTECTED] http://www.nandomedia.com ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: sed(1) regular expression gurus - SOLUTION
First off, thanks to all of you who scratched their heads over this puzzle. All had the right idea to some extent or another. Based in part on the replies, and my own work, here's the final result: FOLDER=$HOME/Mail/spam NAME_RE=[[:alnum:]_.-]+ ADDY_RE=([0-9]{1,3}\.){3}[0-9]{1,3} cat $FOLDER \ |grep -A 5 ^Received: \ |egrep ^(Received:| ) \ |sed -E \ -e s/(^Received:|by|from)[[:space:]]+//g \ -e s/\([HELO]{4}[[:space:]]+($NAME_RE)\)/\1/ \ -e s/\(($NAME_RE)[[:space:]]+\[($ADDY_RE)\]\)/\1 \2/g \ -e s/(\(\[?|\[)($ADDY_RE)(\]|\]?\))/\2/g \ -e s/[[:space:]]*(\(|id|via|with|E?SMTP|;).*// \ -e s/(\(envelope-|for|Sun|Mon|Tue|Wed|Thu|Fri|Sat).*// \ -e s/[][(){}]//g \ Note that the whitespace in the second pipe is one tab character. The first two pipes isolate the multi-line headers. The first sed command strips keywords and any following whitespace. The second sed command returns the name in a parenthetical HELO or EHLO. The third sed command returns the name and address in a (... [...]). The fourth sed command - the one I inquired about - returns the address in any of (...), ([...]), or [...]. The fifth sed command strips possible whitespace, keywords or an opening parenthesis (now that it's of no consequence), and anything after them. The sixth sed command strips more keywords and anything after them (it might be merged into the fifth, what it strips is often on another line). Finally, the last sed command strips any errant delimiters; strictly speaking, it's redundant, but when I ran a spam file (~12.3Mb) through this, some delimiters did leak through. Just thought those that replied to my plea might like to see this, and perhaps somebody else will find it useful. No, I'm not telling what it's for. ;-, Dave -- __ __ \__ \D. J. HAWKEY JR. / __/ \/\ [EMAIL PROTECTED]/\/ http://www.visi.com/~hawkeyd/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Attn: sed(1) regular expression gurus
Hi all. I'm getting really frustrated by a seemingly simple problem. I'm doing this under FreeBSD 4.5. Given these portions of an e-mail's multi-line Received header as tests: by some.host.at.a.com (Postfix) with ESMTP id 3A4E07B03 by some.host.at.a.com (8.11.6) ESMTP; by some.host.at.a.different.com (8.11.6p2/8.11.6) ESMTP; by some.host.at.another.com ([123.4.56.789]) id 3A4E07B03 by some.host.at.yet.another.com (123.4.56.789) id 3A4E07B03 I want to isolate the addresses (one for the 1st through 3rd, two for the 4th and 5th). Here's the sed(1) command I'm playing with: echo by nospam.mc.mpls.visi.com (Postfix) with ESMTP id 3A4E07B03 \ |sed -E \ -e s/by[[:space:]]+// \ -e s/(\((\[?([0-9]{1,3}\.){3}[0-9]{1,3}\]?){0}\)|id|with|E?SMTP).*// In all cases, the parenthetical word is returned, when only the last two should return the parenthetical word. The idea behind the first branch of the second sed(1) command is to match anything that isn't a digits.digits.digits.digits pattern. I've tried simpler expressions like \(\[?[^0-9.]+\]?\), but it fails on the third example. What the devil am I doing wrong?? Am I exercizing known bugs in GNU's sed(1)? Can anyone dream up a different solution - please, no Perl, but awk(1) is fine. Thanks, Dave -- __ __ \__ \D. J. HAWKEY JR. / __/ \/\ [EMAIL PROTECTED]/\/ http://www.visi.com/~hawkeyd/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: sed(1) regular expression gurus
OK, here's a solution using awk - may be possible in sed, but awk has more control statements for this kind of thing: awk --posix -F'[^0-9A-Za-z.]+' ' $1 ~ /by/ { result = $2 for (i=3; i=NF; i++) { if ($i ~ /^([0-9]+\.){3}[0-9]+$/) { result = result $i } } print result }' * Use the field separator to throw away anything that isn't a number, letter or periodic - don't have to worry about brackets anymore * Match lines starting with 'by' and save the second word (which should be a hostname) * Check the following words - if they match an IP address, they're saved too * Then print the result! There may be 'neater' ways of doing it, but it's the most concise example I could come up with. You need to include the --posix option to get the '{3}' notation to work (peculiar to GNU awk). - Original Message - From: D J Hawkey Jr [EMAIL PROTECTED] Subject: Attn: sed(1) regular expression gurus Hi all. I'm getting really frustrated by a seemingly simple problem. I'm doing this under FreeBSD 4.5. Given these portions of an e-mail's multi-line Received header as tests: by some.host.at.a.com (Postfix) with ESMTP id 3A4E07B03 by some.host.at.a.com (8.11.6) ESMTP; by some.host.at.a.different.com (8.11.6p2/8.11.6) ESMTP; by some.host.at.another.com ([123.4.56.789]) id 3A4E07B03 by some.host.at.yet.another.com (123.4.56.789) id 3A4E07B03 I want to isolate the addresses (one for the 1st through 3rd, two for the 4th and 5th). Here's the sed(1) command I'm playing with: echo by nospam.mc.mpls.visi.com (Postfix) with ESMTP id 3A4E07B03 \ |sed -E \ -e s/by[[:space:]]+// \ -e s/(\((\[?([0-9]{1,3}\.){3}[0-9]{1,3}\]?){0}\)|id|with|E?SMTP).*// In all cases, the parenthetical word is returned, when only the last two should return the parenthetical word. The idea behind the first branch of the second sed(1) command is to match anything that isn't a digits.digits.digits.digits pattern. I've tried simpler expressions like \(\[?[^0-9.]+\]?\), but it fails on the third example. What the devil am I doing wrong?? Am I exercizing known bugs in GNU's sed(1)? Can anyone dream up a different solution - please, no Perl, but awk(1) is fine. Thanks, Dave -- __ __ \__ \D. J. HAWKEY JR. / __/ \/\ [EMAIL PROTECTED]/\/ http://www.visi.com/~hawkeyd/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED] ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: sed(1) regular expression gurus
On Jul 15, at 12:49 AM, Rob wrote: awk --posix -F'[^0-9A-Za-z.]+' ' $1 ~ /by/ { result = $2 for (i=3; i=NF; i++) { if ($i ~ /^([0-9]+\.){3}[0-9]+$/) { result = result $i } } print result }' There may be 'neater' ways of doing it, but it's the most concise example I could come up with. This is better than anything I've dreamed up with sed or awk, and is really close, but it fails on this: by nospam.mc.mpls.visi.com (8.11.6/8.11.6.2) with ESMTP id 3A4E07B03 The parenthetical is a [hacked] sendmail version. I don't see how the script fails, though, as you do test for a full/complete dotted quad, and even test for a BOL and EOL on either side it. The 8.11.6 shouldn't match. I changed the '+'es to {1,3}s for even better precision in the if (...), but it didn't make any difference (nor should it have). BTW, why the one or more flag in the FS assignment? You need to include the --posix option to get the '{3}' notation to work (peculiar to GNU awk). Kinda throws portability out the window, but I'll settle for it. Dave -- __ __ \__ \D. J. HAWKEY JR. / __/ \/\ [EMAIL PROTECTED]/\/ http://www.visi.com/~hawkeyd/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: sed(1) regular expression gurus
On Jul 14, at 11:04 AM, D J Hawkey Jr wrote: On Jul 15, at 12:49 AM, Rob wrote: awk --posix -F'[^0-9A-Za-z.]+' ' $1 ~ /by/ { result = $2 for (i=3; i=NF; i++) { if ($i ~ /^([0-9]+\.){3}[0-9]+$/) { result = result $i } } print result }' This is better than anything I've dreamed up with sed or awk, and is really close, but it fails on this: by nospam.mc.mpls.visi.com (8.11.6/8.11.6.2) with ESMTP id 3A4E07B03 Another astute fellow offered this: sed -E \ -e s/by[[:space:]]+// \ -e s/\(\[?(([0-9]{1,3}\.){3}[0-9]{1,3})\]?\)/\1/ \ -e s/[[:space:]]*(\(|id|via|with|E?SMTP|;).*// The idea being to pull anything that looks like an IP address out of parentheses first (2nd command), then junk any other parenthetical stuff with the other cruft on the line (3rd command). I did learn a new syntax from your script, nonetheless. Thanks, Dave -- __ __ \__ \D. J. HAWKEY JR. / __/ \/\ [EMAIL PROTECTED]/\/ http://www.visi.com/~hawkeyd/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: sed(1) regular expression gurus
Dave wrote: This is better than anything I've dreamed up with sed or awk, and is really close, but it fails on this: by nospam.mc.mpls.visi.com (8.11.6/8.11.6.2) with ESMTP id 3A4E07B03 I know you want to avoid perl, but this kind of problem is it's sweet spot. The following might be incrementally better (though the expression to recognize a dotted quad is technically incorrect): perl -ne ' next unless /^by/; @f=(split)[1..2]; $_=pop @f; s/^\D*//g; s/\D*$//g; push(@f,$_) if /^(\d{1,3}\.){3}(\d{1,3})$/; print join( , @f).\n; ' I'm not fluent enough to translate it but I think awk has all the required features to do so. Tom McIntyre __ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: sed(1) regular expression gurus
Probably because I'm using FS to throw away all non-hostname characters - by the time it gets to the sendmail version, there's nothing to distinguish one group of 4 numbers from another. The 'one or more' is for lines like this by some.host.at.another.com ([123.4.56.789]) id 3A4E07B03 ^^^^^^^ ^ where the hostnames (or IPs) are separated by multiple characters. As you've discovered, this isn't necessarily the best approach - Original Message - From: D J Hawkey Jr [EMAIL PROTECTED] Subject: Re: sed(1) regular expression gurus On Jul 15, at 12:49 AM, Rob wrote: awk --posix -F'[^0-9A-Za-z.]+' ' $1 ~ /by/ { result = $2 for (i=3; i=NF; i++) { if ($i ~ /^([0-9]+\.){3}[0-9]+$/) { result = result $i } } print result }' There may be 'neater' ways of doing it, but it's the most concise example I could come up with. This is better than anything I've dreamed up with sed or awk, and is really close, but it fails on this: by nospam.mc.mpls.visi.com (8.11.6/8.11.6.2) with ESMTP id 3A4E07B03 The parenthetical is a [hacked] sendmail version. I don't see how the script fails, though, as you do test for a full/complete dotted quad, and even test for a BOL and EOL on either side it. The 8.11.6 shouldn't match. I changed the '+'es to {1,3}s for even better precision in the if (...), but it didn't make any difference (nor should it have). BTW, why the one or more flag in the FS assignment? You need to include the --posix option to get the '{3}' notation to work (peculiar to GNU awk). Kinda throws portability out the window, but I'll settle for it. Dave -- __ __ \__ \D. J. HAWKEY JR. / __/ \/\ [EMAIL PROTECTED]/\/ http://www.visi.com/~hawkeyd/ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED] ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]