On Tue, 13 Mar 2012 10:10:41 -0400
David F. Skoll wrote:

> On Tue, 13 Mar 2012 06:42:05 -0700 (PDT)
> John Hardin <jhar...@impsec.org> wrote:
> 
> > > PS: I haven't looked at SA's Bayes implementation.  Can it handle
> > > words in non-western character sets properly?
> 
> > It seems to. All of the Chinese-language spam I get hits BAYES_99.
> 
> I took a look at the code, and it does sort-of handle non-Western
> character sets, although I wouldn't say "properly".
> 
> It looks like it simply tokenizes without regard to the character set.
> So a word like "français" would be tokenized as "fran\x{c3}\x{a7}ais"
> if the source character set is UTF-8, but as "fran\x{e7}ais" if the
> source character set is ISO-8859-1.  Am I misunderstanding?

But in some Asian languages what is being tokenized is going to be a
phase or sentence rather than a word, so learning is going to be slow,
and the number of tokens greatly reduced.

The exception is long words (>15 bytes) that contain 2 or more
sequential bytes with the high bit set, which are tokenized as pairs of
bytes.

Reply via email to