Hi, I tried both NgramJ and LanguageWare for automatic language recognition in text documents. NgramJ does not work very well with all Italian language documents while it gets the job done for French and English (tech docs too). LanguageWare is a little more difficult to configure but it works much better with many languages (Italian included). Furthermore it has some interesting features like a "language candidates" collection of possible languages for the document useful in case of high uncertainty. Bye, Tommaso
2008/12/9 Tommaso Teofili <[email protected]> > Hi, > I think I'll give IBM LanguageWare a look because it seems very interesting > and I can easily plugin it into my existing annotator pipeline. > I'll also try NGramJ and see which one has better performance. > My goal is to recognize English, Italian and French. > Thanks to all, I'll let you know here my results. > Tommaso > > 2008/12/8 D.J. McCloskey <[email protected]> > > >> Hi Tommaso, >> >> I saw the mail below on MarkMail and thought you might find what you need >> at http://www.alphaworks.ibm.com/tech/lrw. >> There's a new improved version coming soon but as it stands you will find >> automatic language identification annotator there which is fast and easy >> to >> improve. It also classifies languages when a sufficient confidence is not >> reached into complex text or simple text, essentially indicating whether >> ngramming or whitespace tokenization would be appropriate for further >> interrogation. Which languages are you interested in? >> >> The technology is available for evaluation and if you have further >> interest >> and would like to know more I'd be happy to help you. >> >> >> Subject: Language recognition(Embedded >> image moved to file: >> pic21701.gif)Link to this >> message >> >> From: Tommaso Teofili >> ([email protected]) >> >> Date: 12/08/2008 01:22:52 AM >> >> List: org.apache.incubator.uima-user >> >> >> >> >> >> >> Hello, >> >> >> I am writing an AE pipeline and i need to recognize in which language the >> starting document is written. My idea is to use the Whitespace Tokenizer >> and the HMM Tagger together in order to analyze the extracted tokens, >> calculate the percentage of well known tokens for each language (against a >> dictionary) and then select the highest percentage value language... Do >> you >> know other (better) language recognition methods? Thanks. Tommaso >> >> >> Regards, >> -DJ >> ------------------- >> D.J McCloskey >> IBM LanguageWare Architect >> Email: [email protected] >> >> ... our external website: >> >> http://www-306.ibm.com/software/globalization/topics/languageware/index.jsp >> ... our Alphaworks: http://www.alphaworks.ibm.com/tech/lrw >> ... our Wikipedia: http://en.wikipedia.org/wiki/Languageware >> >> IBM Ireland Product Distribution Limited registered in Ireland with number >> 92815. Registered office: Oldbrook House, 24-32 Pembroke Road, >> Ballsbridge, Dublin 4 > > >
