2012-04-02 12:40:27 -0400, Kris Deugau:
Can anyone point out what bit of stupidity I'm committing in trying
to use this:

rawbody OVERSIZE_COMMENT        m|<!--(?!-->).{32000,}|s

to match messages that are mostly very very long HTML comment(s)?

I've found one way to handle this; use "full" instead of "rawbody". IIRC there is still some chunkifying done to rawbody, so nothing will ever match 32K characters of what's provided for rawbody rules. IIRC the limit is somewhere between 2-3K.

On 4/2/2012 12:58 PM, Stephane Chazelas wrote:
Don't know about the spamassassin issue, but that regexp
matches<!-- followed by a sequence of 32000 of more characters
provided that sequence doesn't start with "-->".

ITYM

m|<!--(?:(?!-->).){32000,}|s

That is you need to look ahead at each character of the sequence
to look for the closing comment tag, otherwise you'll match on
<!-- short comment -->  <31982 or more characters>

Actually, no, it works as intended.

If you uncomment the string fragments below, the 320-character versions both match but the 32000-character ones don't. As-is, neither matches.

my $shorty = "<!-- foo bar baxkja safdjwelkj werf kjwlekrjwlekr jlwkerk jawelkj awlekj lakewjflakwjef lakj ". #"awelkj alkfj awlekfj lawie fjalwief jlawijfe lawiejflfiwj elifj lawiej4lti j34wlit j43wli jliajs lij flisaj ". #"flsaidfj liasjdf lisdj lijsa fldi fa;slkjf;lask j;lkaj fs; jfsdjf sak hflkshf lksj fhlksaj fhlska fhlkajs ". "fhlkajshlkjashflkjasdfhlkjsahdflkjas hlkfh lwelif hwli3u fhliuwae fhliuawfheliuhfliu fhwei ufhsd fg/sd ". "/dsf/g/sdafg /sdf/ gdf/sg/sdf/gds/g/sd/th/ser/h/ser/ghs/rg/srg/ser/gs/erg/ser/g/ser/g/ser/g/ser/g/serg -->";

my @regex = ('<!--(?:(?!-->).){32000,}', '<!--(?:(?!-->).){320,}', '<!--(?!-->).{32000,}', '<!--(?!-->).{320,}');

foreach (@regex) {
  print "$_ shorty ok\n" if $shorty =~ m/$_/s;
}

(And yes, this is almost exactly what I'm seeing in these monster comments, although they're usually at least mostly real words, and they are in the ~100K+-characters length range.)

Bowie Bailey wrote:
And you may or may not want to match on a closing comment at the end.

m|<!--(?:(?!-->).){32000,}-->|s

Enh, I don't think it matters.

However, when testing in a minimal Perl script that just tries to match on the whole raw message, my original works fine; I don't need the extra non-capturing parentheses.

Also, because of all of the lookaheads, this may be an expensive
regexp.  If you try it, keep a close eye on your SA.  If it slows down
to a crawl, this is probably the culprit.

None of the variants seem to be *too* nasty on the CPU though; feeding one of these monster messages through a minimal Perl script as above that just runs a handful of regexes showed:

real    0m0.050s
user    0m0.045s
sys     0m0.012s

-kgd

Reply via email to