Re: Canonicalizing text parts to UTF-8 before applying body rules

Andrzej A. Filip Thu, 31 May 2012 00:05:58 -0700
On 05/29/2012 09:58 PM, David F. Skoll wrote:
> This idea is growing out of a thread I started in which someone pointed me
> to https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062
>
> Ignoring the locale under which SA runs and also ignoring the character
> encoding of the message can make body matching rules behave differently
> on different systems and just plain incorrectly for some messages.
>
> I'm thinking of making something (a plugin, maybe?) that canonicalizes
> text/* parts to UTF-8 and lets you write rules using Unicode regexes.
> Something like:
>
> body_utf8  __DRUGS_MUSCLE1 /.. proper Unicode regex/...
>
> According to the perlunicode man page:
>
>    Regular Expressions
>        The regular expression compiler produces polymorphic opcodes.  That
>        is, the pattern adapts to the data and automatically switches to
>        the Unicode character scheme when presented with data that is
>        internally encoded in UTF-8 -- or instead uses a traditional byte
>        scheme when presented with byte data.
>
> so assuming we present it with proper UTF-8 data, the regexes should Just 
> Work.
>
> I'm not sure how easy this will be, but I think it's worthwhile.
> In the long run, I think all body rules should be body_utf8 and another
> rule type should provide access to the body in its original encoding if that
> is needed.
>
> Comments?  Suggestions?
It is a nice idea IMHO.
But it is worth to remember:
a) Unicode itself may require  canonicalization too.
    Some chars may be represented in Unicode as single character of a
composition of a few characters
b) some spammers do not declare encoding properly so some encoding
guessing would be handy
c) It would be nice to allow access to _both_ raw (bytes) and utf-8
encoded message body
d) many people in "ASCII part of the world" would not need it anyway :-)
Re: Canonicalizing text parts to UTF-8 before applying body rules

Reply via email to