On Fri, 20 Sep 2013 14:20:58 -0400 "Kevin A. McGrail" <kmcgr...@pccc.com> wrote:
> As of yet, I'm not using normalize_charset and researching what hits > things the best. You won't like my answer, but... You really *have* to normalize everything to Unicode (possible using UTF-8 as the canonical on-disk format) before trying to apply rules or extract Bayes tokens. Then you can do nice things like blocking CJK spams with a rule like: header CJK_SUBJECT Subject =~ /\p{CJK_Unified_Ideographs} and have absolute confidence it will work no matter how the subject is encoded. I haven't looked extremely closely at the SpamAssassin code so I'm not sure how its normalization works nor whether it can do the necessary transformations for a subject rule such as my example to work. Regards, David.