The Bayes algorithms favor sparse data with large numbers of potential features. Text is one kind of this data.
Using Naive Bayes with unicode should be fine. The simplest method for processing CJK text is to use character unigrams and bigrams. This works very well with retrieval systems, but I haven't heard if it would work with classification although I expect it would. On Sun, Jan 1, 2012 at 7:02 PM, Lingxiang Cheng <[email protected]>wrote: > It's interesting that the Bayes algorithms in Mahout strongly favor text > data than numeric data. I am thinking about using them to categorize > chinese websites. Has anyone used it to process unicodes?
