2012-04-02 12:40:27 -0400, Kris Deugau:
Can anyone point out what bit of stupidity I'm committing in trying
to use this:
rawbody OVERSIZE_COMMENT m|<!--(?!-->).{32000,}|s
to match messages that are mostly very very long HTML comment(s)?
I've found one way to handle this; use "full" instead of "rawbody".
IIRC there is still some chunkifying done to rawbody, so nothing will
ever match 32K characters of what's provided for rawbody rules. IIRC
the limit is somewhere between 2-3K.
On 4/2/2012 12:58 PM, Stephane Chazelas wrote:
Don't know about the spamassassin issue, but that regexp
matches<!-- followed by a sequence of 32000 of more characters
provided that sequence doesn't start with "-->".
ITYM
m|<!--(?:(?!-->).){32000,}|s
That is you need to look ahead at each character of the sequence
to look for the closing comment tag, otherwise you'll match on
<!-- short comment --> <31982 or more characters>
Actually, no, it works as intended.
If you uncomment the string fragments below, the 320-character versions
both match but the 32000-character ones don't. As-is, neither matches.
my $shorty = "<!-- foo bar baxkja safdjwelkj werf kjwlekrjwlekr jlwkerk
jawelkj awlekj lakewjflakwjef lakj ".
#"awelkj alkfj awlekfj lawie fjalwief jlawijfe lawiejflfiwj elifj
lawiej4lti j34wlit j43wli jliajs lij flisaj ".
#"flsaidfj liasjdf lisdj lijsa fldi fa;slkjf;lask j;lkaj fs; jfsdjf sak
hflkshf lksj fhlksaj fhlska fhlkajs ".
"fhlkajshlkjashflkjasdfhlkjsahdflkjas hlkfh lwelif hwli3u fhliuwae
fhliuawfheliuhfliu fhwei ufhsd fg/sd ".
"/dsf/g/sdafg /sdf/
gdf/sg/sdf/gds/g/sd/th/ser/h/ser/ghs/rg/srg/ser/gs/erg/ser/g/ser/g/ser/g/ser/g/serg
-->";
my @regex = ('<!--(?:(?!-->).){32000,}', '<!--(?:(?!-->).){320,}',
'<!--(?!-->).{32000,}', '<!--(?!-->).{320,}');
foreach (@regex) {
print "$_ shorty ok\n" if $shorty =~ m/$_/s;
}
(And yes, this is almost exactly what I'm seeing in these monster
comments, although they're usually at least mostly real words, and they
are in the ~100K+-characters length range.)
Bowie Bailey wrote:
And you may or may not want to match on a closing comment at the end.
m|<!--(?:(?!-->).){32000,}-->|s
Enh, I don't think it matters.
However, when testing in a minimal Perl script that just tries to match
on the whole raw message, my original works fine; I don't need the
extra non-capturing parentheses.
Also, because of all of the lookaheads, this may be an expensive
regexp. If you try it, keep a close eye on your SA. If it slows down
to a crawl, this is probably the culprit.
None of the variants seem to be *too* nasty on the CPU though; feeding
one of these monster messages through a minimal Perl script as above
that just runs a handful of regexes showed:
real 0m0.050s
user 0m0.045s
sys 0m0.012s
-kgd