On Tue, 13 Mar 2012 10:10:41 -0400 David F. Skoll wrote: > On Tue, 13 Mar 2012 06:42:05 -0700 (PDT) > John Hardin <jhar...@impsec.org> wrote: > > > > PS: I haven't looked at SA's Bayes implementation. Can it handle > > > words in non-western character sets properly? > > > It seems to. All of the Chinese-language spam I get hits BAYES_99. > > I took a look at the code, and it does sort-of handle non-Western > character sets, although I wouldn't say "properly". > > It looks like it simply tokenizes without regard to the character set. > So a word like "français" would be tokenized as "fran\x{c3}\x{a7}ais" > if the source character set is UTF-8, but as "fran\x{e7}ais" if the > source character set is ISO-8859-1. Am I misunderstanding?
But in some Asian languages what is being tokenized is going to be a phase or sentence rather than a word, so learning is going to be slow, and the number of tokens greatly reduced. The exception is long words (>15 bytes) that contain 2 or more sequential bytes with the high bit set, which are tokenized as pairs of bytes.