Ted , I think this is the latest tokenizer. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/Token.html
Do you have any suggestion , how do I see the intermediate tokens generated ? So that I can verify with Hindi text as string ? Thanks Neil On Sun, Oct 3, 2010 at 10:20 AM, Ted Dunning <[email protected]> wrote: > Hindi should be pretty good to go with the default Lucene analyzer. You > should look at the > tokens to be sure they are reasonable. Punctuation and some other work > breaking characters > in Hindi may not be handled well, but if the first five sentences work > well, > you should be OK. > > On Sat, Oct 2, 2010 at 9:31 PM, Bhaskar Ghosh <[email protected]> > wrote: > > > Hi Ted, > > > > I need to tokenize Hindi, an Indian language. I learnt from Robin earlier > > that > > "Classifier supports non english tokens(its assumes string is Utf8 > > encoded)", > > Does that mean that the Classifier would just tokenize based on unicode > > encoding, so that we do not need to worry about the language? Or, we do > > need to > > make some configurations? > > > > I do not have a knowledge of factors that makes a language harder to > > tokenize. > > But, I have learnt from earlier conversations in this mailing list, that > > languages in which a word is represented as multi-worded (sequence of > > words), > > are hard to tokenize. In that sense, I can assume that words in Hindi > would > > be > > single words. > > > > Thanks > > Bhaskar Ghosh > > Hyderabad, India > > > > http://www.google.com/profiles/bjgindia > > > > "Ignorance is Bliss... Knowledge never brings Peace!!!" > > > > > > > > > > ________________________________ > > From: Ted Dunning <[email protected]> > > To: [email protected] > > Sent: Sun, 3 October, 2010 12:53:37 AM > > Subject: Re: How to get multi-language support for training/classifying > > text > > into classes through Mahout? > > > > You will need to make sure that the tokenization is done reasonable. > > > > There is an example program for a sequential classifier in > > org.apache.mahout.classifiers.sgd.TrainNewsGroups > > > > It assumes data in the 20 news groups format and uses a Lucene tokenizer. > > > > The NaiveBayes code also uses a Lucene tokenizer that you can specify on > > the > > command line. > > > > Can you say which languages? Are they easy to tokenize (like French)? > Or > > medium (like German/Turkish)? > > Or hard (like Chinese/Japanese)? > > > > Can you say how much data? > > > > On Sat, Oct 2, 2010 at 8:46 AM, Bhaskar Ghosh <[email protected]> > > wrote: > > > > > Dear All, > > > > > > I have a requirement where I need to classify text in a non-English > > > language. I > > > have heard that Mahout supports multi-language. Can anyone please tell > me > > > how do > > > I achieve this? Some documents/links where I can get some examples on > > this, > > > would be really really helpful. > > > Regards > > > Bhaskar Ghosh > > > Hyderabad, India > > > > > > http://www.google.com/profiles/bjgindia > > > > > > "Ignorance is Bliss... Knowledge never brings Peace!!!" > > > > > > > > > > > > > > > > -- Thanks and Regards Neil http://neilghosh.com
