On 05/29/2012 09:58 PM, David F. Skoll wrote:
> This idea is growing out of a thread I started in which someone pointed me
> to https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062
>
> Ignoring the locale under which SA runs and also ignoring the character
> encoding of the message can make body matching rules behave differently
> on different systems and just plain incorrectly for some messages.
>
> I'm thinking of making something (a plugin, maybe?) that canonicalizes
> text/* parts to UTF-8 and lets you write rules using Unicode regexes.
> Something like:
>
> body_utf8 __DRUGS_MUSCLE1 /.. proper Unicode regex/...
>
> According to the perlunicode man page:
>
> Regular Expressions
> The regular expression compiler produces polymorphic opcodes. That
> is, the pattern adapts to the data and automatically switches to
> the Unicode character scheme when presented with data that is
> internally encoded in UTF-8 -- or instead uses a traditional byte
> scheme when presented with byte data.
>
> so assuming we present it with proper UTF-8 data, the regexes should Just
> Work.
>
> I'm not sure how easy this will be, but I think it's worthwhile.
> In the long run, I think all body rules should be body_utf8 and another
> rule type should provide access to the body in its original encoding if that
> is needed.
>
> Comments? Suggestions?
It is a nice idea IMHO.
But it is worth to remember:
a) Unicode itself may require canonicalization too.
Some chars may be represented in Unicode as single character of a
composition of a few characters
b) some spammers do not declare encoding properly so some encoding
guessing would be handy
c) It would be nice to allow access to _both_ raw (bytes) and utf-8
encoded message body
d) many people in "ASCII part of the world" would not need it anyway :-)