Re: How to get multi-language support for training/classifying text into classes through Mahout?

Ted Dunning Sun, 03 Oct 2010 11:42:50 -0700

Yes.  Post to the Lucene list and if you get an answer from Robert Muir,
listen especially carefully.


To answer the question, this code snippet could be adapted/corrected to
print out tokens in your data.  (don't
assume it works as it stands!)

    for (String line : Files.readLines(new File("my/file/here"))) {
       TokenStream ts = analyzer.tokenStream("text", new
StringReader(line));
       ts.addAttribute(TermAttribute.class);
       while (ts.incrementToken()) {
          String s = ts.getAttribute(TermAttribute.class).term();
          words.add(s);
       }
    }


On Sun, Oct 3, 2010 at 9:34 AM, Ken Krugler <[email protected]>wrote:

> Hi Neil,
>
> That's really a Lucene question, not something for Mahout.
>
> If you post to the Lucene list, you're also likely get some useful feedback
> from the community about whether there are issues with tokenizing Hindi.
>
> E.g. there was an email from last summer about this same topic. Snippet is:
>
>  Apart from using WhiteSpaceAnalyzer which will tokenize words based on
>> spaces, you can try writing a simple custom analyzer which'll a bit more.
>> I
>> did the following for handling Indic languages intermingled with English
>> content,
>>
>> /**
>> * Analyzer for Indian language.
>> */
>> public class IndicAnalyzerIndex extends Analyzer {
>>   public TokenStream tokenStream(String fieldName, Reader reader) {
>>       TokenStream ts = new WhitespaceTokenizer(reader);
>>       /**
>>
>
>
> -- Ken
>
> PS - latest Lucene that's released is 3.0.2, not 2.4.0 (what you reference
> below)
>
>
> On Oct 3, 2010, at 12:10am, Neil Ghosh wrote:
>
>  Ted , I think this is the latest tokenizer.
>>
>>
>> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/Token.html
>>
>> Do you have any suggestion , how do I see the intermediate tokens
>> generated
>> ? So that I can verify with Hindi text as string ?
>>
>> Thanks
>> Neil
>>
>> On Sun, Oct 3, 2010 at 10:20 AM, Ted Dunning <[email protected]>
>> wrote:
>>
>>  Hindi should be pretty good to go with the default Lucene analyzer.  You
>>> should look at the
>>> tokens to be sure they are reasonable.  Punctuation and some other work
>>> breaking characters
>>> in Hindi may not be handled well, but if the first five sentences work
>>> well,
>>> you should be OK.
>>>
>>> On Sat, Oct 2, 2010 at 9:31 PM, Bhaskar Ghosh <[email protected]>
>>> wrote:
>>>
>>>  Hi Ted,
>>>>
>>>> I need to tokenize Hindi, an Indian language. I learnt from Robin
>>>> earlier
>>>> that
>>>> "Classifier supports non english tokens(its assumes string is Utf8
>>>> encoded)",
>>>> Does that mean that the Classifier would just tokenize based on unicode
>>>> encoding, so that we do not need to worry about the language? Or, we do
>>>> need to
>>>> make some configurations?
>>>>
>>>> I do not have a knowledge of factors that makes a language harder to
>>>> tokenize.
>>>> But, I have learnt from earlier conversations in this mailing list, that
>>>> languages in which a word is represented as multi-worded (sequence of
>>>> words),
>>>> are hard to tokenize. In that sense, I can assume that words in Hindi
>>>>
>>> would
>>>
>>>> be
>>>> single words.
>>>>
>>>> Thanks
>>>> Bhaskar Ghosh
>>>> Hyderabad, India
>>>>
>>>> http://www.google.com/profiles/bjgindia
>>>>
>>>> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Ted Dunning <[email protected]>
>>>> To: [email protected]
>>>> Sent: Sun, 3 October, 2010 12:53:37 AM
>>>> Subject: Re: How to get multi-language support for training/classifying
>>>> text
>>>> into classes through Mahout?
>>>>
>>>> You will need to make sure that the tokenization is done reasonable.
>>>>
>>>> There is an example program for a sequential classifier in
>>>> org.apache.mahout.classifiers.sgd.TrainNewsGroups
>>>>
>>>> It assumes data in the 20 news groups format and uses a Lucene
>>>> tokenizer.
>>>>
>>>> The NaiveBayes code also uses a Lucene tokenizer that you can specify on
>>>> the
>>>> command line.
>>>>
>>>> Can you say which languages?  Are they easy to tokenize (like French)?
>>>>
>>> Or
>>>
>>>> medium (like German/Turkish)?
>>>> Or hard (like Chinese/Japanese)?
>>>>
>>>> Can you say how much data?
>>>>
>>>> On Sat, Oct 2, 2010 at 8:46 AM, Bhaskar Ghosh <[email protected]>
>>>> wrote:
>>>>
>>>>  Dear All,
>>>>>
>>>>> I have a requirement where I need to classify text in a non-English
>>>>> language. I
>>>>> have heard that Mahout supports multi-language. Can anyone please tell
>>>>>
>>>> me
>>>
>>>> how do
>>>>> I achieve this? Some documents/links where I can get some examples on
>>>>>
>>>> this,
>>>>
>>>>> would be really really helpful.
>>>>> Regards
>>>>> Bhaskar Ghosh
>>>>> Hyderabad, India
>>>>>
>>>>> http://www.google.com/profiles/bjgindia
>>>>>
>>>>> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Thanks and Regards
>> Neil
>> http://neilghosh.com
>>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>
>

Re: How to get multi-language support for training/classifying text into classes through Mahout?

Reply via email to