Ted , I think this is the latest tokenizer.

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/Token.html

Do you have any suggestion , how do I see the intermediate tokens generated
? So that I can verify with Hindi text as string ?

Thanks
Neil

On Sun, Oct 3, 2010 at 10:20 AM, Ted Dunning <[email protected]> wrote:

> Hindi should be pretty good to go with the default Lucene analyzer.  You
> should look at the
> tokens to be sure they are reasonable.  Punctuation and some other work
> breaking characters
> in Hindi may not be handled well, but if the first five sentences work
> well,
> you should be OK.
>
> On Sat, Oct 2, 2010 at 9:31 PM, Bhaskar Ghosh <[email protected]>
> wrote:
>
> > Hi Ted,
> >
> > I need to tokenize Hindi, an Indian language. I learnt from Robin earlier
> > that
> > "Classifier supports non english tokens(its assumes string is Utf8
> > encoded)",
> > Does that mean that the Classifier would just tokenize based on unicode
> > encoding, so that we do not need to worry about the language? Or, we do
> > need to
> > make some configurations?
> >
> > I do not have a knowledge of factors that makes a language harder to
> > tokenize.
> > But, I have learnt from earlier conversations in this mailing list, that
> > languages in which a word is represented as multi-worded (sequence of
> > words),
> > are hard to tokenize. In that sense, I can assume that words in Hindi
> would
> > be
> > single words.
> >
> >  Thanks
> > Bhaskar Ghosh
> > Hyderabad, India
> >
> > http://www.google.com/profiles/bjgindia
> >
> > "Ignorance is Bliss... Knowledge never brings Peace!!!"
> >
> >
> >
> >
> > ________________________________
> > From: Ted Dunning <[email protected]>
> > To: [email protected]
> > Sent: Sun, 3 October, 2010 12:53:37 AM
> > Subject: Re: How to get multi-language support for training/classifying
> > text
> > into classes through Mahout?
> >
> > You will need to make sure that the tokenization is done reasonable.
> >
> > There is an example program for a sequential classifier in
> > org.apache.mahout.classifiers.sgd.TrainNewsGroups
> >
> > It assumes data in the 20 news groups format and uses a Lucene tokenizer.
> >
> > The NaiveBayes code also uses a Lucene tokenizer that you can specify on
> > the
> > command line.
> >
> > Can you say which languages?  Are they easy to tokenize (like French)?
>  Or
> > medium (like German/Turkish)?
> > Or hard (like Chinese/Japanese)?
> >
> > Can you say how much data?
> >
> > On Sat, Oct 2, 2010 at 8:46 AM, Bhaskar Ghosh <[email protected]>
> > wrote:
> >
> > > Dear All,
> > >
> > > I have a requirement where I need to classify text in a non-English
> > > language. I
> > > have heard that Mahout supports multi-language. Can anyone please tell
> me
> > > how do
> > > I achieve this? Some documents/links where I can get some examples on
> > this,
> > > would be really really helpful.
> > >  Regards
> > > Bhaskar Ghosh
> > > Hyderabad, India
> > >
> > > http://www.google.com/profiles/bjgindia
> > >
> > > "Ignorance is Bliss... Knowledge never brings Peace!!!"
> > >
> > >
> > >
> >
> >
> >
>



-- 
Thanks and Regards
Neil
http://neilghosh.com

Reply via email to