Hi Tommaso,

you could use TextCat
http://odur.let.rug.nl/~vannoord/TextCat/

or one of its competitors:
http://odur.let.rug.nl/~vannoord/TextCat/competitors.html

-Torsten 

> -----Original Message-----
> From: Tommaso Teofili [mailto:[EMAIL PROTECTED] 
> Sent: Monday, December 08, 2008 10:23 AM
> To: [email protected]
> Subject: Language recognition
> 
> Hello,
> I am writing an AE pipeline and i need to recognize in which 
> language the
> starting document is written.
> My idea is to use the Whitespace Tokenizer and the HMM Tagger 
> together in
> order to analyze the extracted tokens, calculate the 
> percentage of well
> known tokens for each language (against a dictionary) and 
> then select the
> highest percentage value language...
> Do you know other (better) language recognition methods?
> Thanks.
> Tommaso
> 

Reply via email to