You will need to make sure that the tokenization is done reasonable. There is an example program for a sequential classifier in org.apache.mahout.classifiers.sgd.TrainNewsGroups
It assumes data in the 20 news groups format and uses a Lucene tokenizer. The NaiveBayes code also uses a Lucene tokenizer that you can specify on the command line. Can you say which languages? Are they easy to tokenize (like French)? Or medium (like German/Turkish)? Or hard (like Chinese/Japanese)? Can you say how much data? On Sat, Oct 2, 2010 at 8:46 AM, Bhaskar Ghosh <[email protected]> wrote: > Dear All, > > I have a requirement where I need to classify text in a non-English > language. I > have heard that Mahout supports multi-language. Can anyone please tell me > how do > I achieve this? Some documents/links where I can get some examples on this, > would be really really helpful. > Regards > Bhaskar Ghosh > Hyderabad, India > > http://www.google.com/profiles/bjgindia > > "Ignorance is Bliss... Knowledge never brings Peace!!!" > > >
