Hi Tommaso, you could use TextCat http://odur.let.rug.nl/~vannoord/TextCat/
or one of its competitors: http://odur.let.rug.nl/~vannoord/TextCat/competitors.html -Torsten > -----Original Message----- > From: Tommaso Teofili [mailto:[EMAIL PROTECTED] > Sent: Monday, December 08, 2008 10:23 AM > To: [email protected] > Subject: Language recognition > > Hello, > I am writing an AE pipeline and i need to recognize in which > language the > starting document is written. > My idea is to use the Whitespace Tokenizer and the HMM Tagger > together in > order to analyze the extracted tokens, calculate the > percentage of well > known tokens for each language (against a dictionary) and > then select the > highest percentage value language... > Do you know other (better) language recognition methods? > Thanks. > Tommaso >
