On 9/20/2013 2:30 PM, David F. Skoll wrote:
You won't like my answer, but...
You really*have* to normalize everything to Unicode (possible using UTF-8
as the canonical on-disk format) before trying to apply rules or extract
Bayes tokens. Then you can do nice things like blocking CJK spams
with a rule like:
header CJK_SUBJECT Subject =~ /\p{CJK_Unified_Ideographs}
and have absolute confidence it will work no matter how the subject is
encoded.
I haven't looked extremely closely at the SpamAssassin code so I'm not
sure how its normalization works nor whether it can do the necessary
transformations for a subject rule such as my example to work.
Your answer helps because I'm sort of hitting what I think is a bit of a
systemic issue. I'm banging away at this.
Regards,
KAM