Re: Regex help

Adam Katz Thu, 21 Apr 2011 15:33:11 -0700

Before I help you with your shell and regex issues, I should point out
that this is not a very strong rule.  It will hit ham.

On 04/21/2011 02:54 PM, Kevin Miller wrote:
> I'm trying to write a local rule that will scan for 5 or more 
> instances of "<br>" but not having much luck.  I'm testing first on 
> the CLI, just trying to get the syntax down.

> What works:
> I have a file called DomainLiterals.txt with repeating characters
> and it returns expected results:
> mkm@mis-mkm-lnx:~$ egrep \[10.]{3} DomainLiterals.txt 
> you can add a line containing only [10.10.10.10] to
> /etc/mail/local-host-names where 10.10.10.10 is the IP address you

The regex '\10.]{3}' is invalid.  It un-escapes from the command line as
'[10.]{3}' but will match any of these:

111
...
000
10.
.01

since it is asking for three of any character matching one, zero, or
dot.  The grouping symbol you are looking for is a curly-bracket, and
the dot (when outside a square bracket) must be escaped as it otherwise
means "any single character."

> However, doing this fails:
> mxg:/var/spool/MailScanner/quarantine/20110421/nonspam # egrep \[<br>]{5,} 
> p3LJZSnX024470
> -bash: br: No such file or directory
> 
> The file p3LJZSnX024470 is just a plain text file in a quarantine directory.

Again, you have a CLI escaping issue AND a regex issue.  If you are not
quoting that query, you need to escape almost every single punctuation
character listed there.  Alternatively, you could put that query in quotes.

"egrep \[<br>]{5,} p3L..." tells the shell that you are looking for the
query "[" from input file "br" and you want to output your results to
(invalid) file "]" and then run the command "5," in a subshell, followed
by a third command (your email file).

"egrep '[<br>]{5,}' p3L..." prevents the shell from trying to interpret
your query but still has a bad query, as it looks for five or more
consecutive occurrences of any character listed between the angle
brackets, so "<b>brr</b>" will match up to the slash.

> What am I missing? I'll turn this into a body rule once I get the
> syntax right then test it for a day or so w/a score of .01. If I'm not
> hitting legitimate mail I'll bump it up.

On top of all of this, egrep does not use Perl-compatible regular
expressions (PCRE) (though the regexps I've used so far are compatible
with Posix regexps as well as PCRE).  See 'man perlre' (or your favorite
website) for help on PCREs.  Try using either grep -P (requires
libpcre3) or pcregrep (which you may have to install) or else perl
itself, like:

  perl -ne 'print if /whatever/'  < DomainLiterals.txt

As to what that should be searching for, I suspect you want a multi-line
expression (which none of the above shell commands will help you with
since they parse one line at a time).  Try this:

header  LOCAL_10_10_10_10  X-Spam-Relays-Untrusted
   =~ /^[^\[]+ ip=(?:10\.){3}/

rawbody LOCAL_5X_BR_TAGS   /(?:<br\/?>[\s\r\n]{0,4}){5}/mi

That second one will also match <br/> and allows for a few spaces, tabs,
or linebreaks in between the <br> tags.  For a more strict version of
what you're looking for, try this:

rawbody LOCAL_5X_BR_TAGS   /(?:<br>){5}/i

Note that you need rawbody since body rules will strip HTML.

Again, this rule will hit some hams.  It is also not terribly CPU-efficient.

Better solution:  put some examples up on a pastebin and link them to us
so we can help you find more diagnostic (and simpler) patterns to nail
them with.

signature.asc
Description: OpenPGP digital signature

Re: Regex help

Reply via email to