Re: Language recognition

Hannes Carl Meyer Mon, 08 Dec 2008 01:53:50 -0800

Hi Tommaso,

one common method for language recognition is based on n-grams.
There are also some java implementations out there, for example NGramJ:
http://ngramj.sourceforge.net/


Nutch (crawler from Lucene) also uses the n-gram approach, find some
information about here http://wiki.apache.org/nutch/LanguageIdentifier and
here http://wiki.apache.org/nutch/LanguageIdentifierPlugin

I wouldn't suggest to reinvent the wheel unless it is a bigger, faster one!

Regards

Hannes
---
http://mimblog.de

On Mon, Dec 8, 2008 at 10:23 AM, Tommaso Teofili
<[EMAIL PROTECTED]>wrote:

> Hello,
> I am writing an AE pipeline and i need to recognize in which language the
> starting document is written.
> My idea is to use the Whitespace Tokenizer and the HMM Tagger together in
> order to analyze the extracted tokens, calculate the percentage of well
> known tokens for each language (against a dictionary) and then select the
> highest percentage value language...
> Do you know other (better) language recognition methods?
> Thanks.
> Tommaso
>

Re: Language recognition

Reply via email to