You will need to make sure that the tokenization is done reasonable.

There is an example program for a sequential classifier in
org.apache.mahout.classifiers.sgd.TrainNewsGroups

It assumes data in the 20 news groups format and uses a Lucene tokenizer.

The NaiveBayes code also uses a Lucene tokenizer that you can specify on the
command line.

Can you say which languages?  Are they easy to tokenize (like French)?  Or
medium (like German/Turkish)?
Or hard (like Chinese/Japanese)?

Can you say how much data?

On Sat, Oct 2, 2010 at 8:46 AM, Bhaskar Ghosh <[email protected]> wrote:

> Dear All,
>
> I have a requirement where I need to classify text in a non-English
> language. I
> have heard that Mahout supports multi-language. Can anyone please tell me
> how do
> I achieve this? Some documents/links where I can get some examples on this,
> would be really really helpful.
>  Regards
> Bhaskar Ghosh
> Hyderabad, India
>
> http://www.google.com/profiles/bjgindia
>
> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>
>
>

Reply via email to