Hi Tommaso, one common method for language recognition is based on n-grams. There are also some java implementations out there, for example NGramJ: http://ngramj.sourceforge.net/
Nutch (crawler from Lucene) also uses the n-gram approach, find some information about here http://wiki.apache.org/nutch/LanguageIdentifier and here http://wiki.apache.org/nutch/LanguageIdentifierPlugin I wouldn't suggest to reinvent the wheel unless it is a bigger, faster one! Regards Hannes --- http://mimblog.de On Mon, Dec 8, 2008 at 10:23 AM, Tommaso Teofili <[EMAIL PROTECTED]>wrote: > Hello, > I am writing an AE pipeline and i need to recognize in which language the > starting document is written. > My idea is to use the Whitespace Tokenizer and the HMM Tagger together in > order to analyze the extracted tokens, calculate the percentage of well > known tokens for each language (against a dictionary) and then select the > highest percentage value language... > Do you know other (better) language recognition methods? > Thanks. > Tommaso >
