Re: How to get multi-language support for training/classifying text into classes through Mahout?

Ken Krugler Sun, 03 Oct 2010 09:35:06 -0700

Hi Neil,

That's really a Lucene question, not something for Mahout.

If you post to the Lucene list, you're also likely get some usefulfeedback from the community about whether there are issues withtokenizing Hindi.

E.g. there was an email from last summer about this same topic.Snippet is:

Apart from using WhiteSpaceAnalyzer which will tokenize words based on
spaces, you can try writing a simple custom analyzer which'll a bitmore. Idid the following for handling Indic languages intermingled withEnglish
content,

/**
* Analyzer for Indian language.
*/
public class IndicAnalyzerIndex extends Analyzer {
   public TokenStream tokenStream(String fieldName, Reader reader) {
       TokenStream ts = new WhitespaceTokenizer(reader);
       /**



-- Ken

PS - latest Lucene that's released is 3.0.2, not 2.4.0 (what youreference below)


On Oct 3, 2010, at 12:10am, Neil Ghosh wrote:

Ted , I think this is the latest tokenizer.

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/Token.html
Do you have any suggestion , how do I see the intermediate tokensgenerated
? So that I can verify with Hindi text as string ?

Thanks
Neil
On Sun, Oct 3, 2010 at 10:20 AM, Ted Dunning <[email protected]>wrote:
Hindi should be pretty good to go with the default Luceneanalyzer. You
should look at the
tokens to be sure they are reasonable. Punctuation and some otherwork
breaking characters
in Hindi may not be handled well, but if the first five sentenceswork
well,
you should be OK.

On Sat, Oct 2, 2010 at 9:31 PM, Bhaskar Ghosh <[email protected]>
wrote:
Hi Ted,
I need to tokenize Hindi, an Indian language. I learnt from Robinearlier
that
"Classifier supports non english tokens(its assumes string is Utf8
encoded)",
Does that mean that the Classifier would just tokenize based onunicodeencoding, so that we do not need to worry about the language? Or,we do
need to
make some configurations?

I do not have a knowledge of factors that makes a language harder to
tokenize.
But, I have learnt from earlier conversations in this mailinglist, thatlanguages in which a word is represented as multi-worded (sequenceof
words),
are hard to tokenize. In that sense, I can assume that words inHindi
would
be
single words.

Thanks
Bhaskar Ghosh
Hyderabad, India

http://www.google.com/profiles/bjgindia

"Ignorance is Bliss... Knowledge never brings Peace!!!"




________________________________
From: Ted Dunning <[email protected]>
To: [email protected]
Sent: Sun, 3 October, 2010 12:53:37 AM
Subject: Re: How to get multi-language support for training/classifying
text
into classes through Mahout?

You will need to make sure that the tokenization is done reasonable.

There is an example program for a sequential classifier in
org.apache.mahout.classifiers.sgd.TrainNewsGroups
It assumes data in the 20 news groups format and uses a Lucenetokenizer.
The NaiveBayes code also uses a Lucene tokenizer that you canspecify on
the
command line.
Can you say which languages? Are they easy to tokenize (likeFrench)?
Or
medium (like German/Turkish)?
Or hard (like Chinese/Japanese)?

Can you say how much data?

On Sat, Oct 2, 2010 at 8:46 AM, Bhaskar Ghosh <[email protected]>
wrote:
Dear All,

I have a requirement where I need to classify text in a non-English
language. I
have heard that Mahout supports multi-language. Can anyone pleasetell
me
how do
I achieve this? Some documents/links where I can get someexamples on
this,
would be really really helpful.
Regards
Bhaskar Ghosh
Hyderabad, India

http://www.google.com/profiles/bjgindia

"Ignorance is Bliss... Knowledge never brings Peace!!!"
--
Thanks and Regards
Neil
http://neilghosh.com


--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: How to get multi-language support for training/classifying text into classes through Mahout?

Reply via email to