Hi, if you're experiencing problems with the results of n-gram based language recognition in a specific language, try to exclude profiles from languages you don't need to recognize! Regards, Hannes
On Sun, Dec 21, 2008 at 6:55 PM, Tommaso Teofili <[email protected]>wrote: > Hi, > I tried both NgramJ and LanguageWare for automatic language recognition in > text documents. > NgramJ does not work very well with all Italian language documents while it > gets the job done for French and English (tech docs too). > LanguageWare is a little more difficult to configure but it works much > better with many languages (Italian included). Furthermore it has some > interesting features like a "language candidates" collection of possible > languages for the document useful in case of high uncertainty. > Bye, > Tommaso > > > 2008/12/9 Tommaso Teofili <[email protected]> > > > Hi, > > I think I'll give IBM LanguageWare a look because it seems very > interesting > > and I can easily plugin it into my existing annotator pipeline. > > I'll also try NGramJ and see which one has better performance. > > My goal is to recognize English, Italian and French. > > Thanks to all, I'll let you know here my results. > > Tommaso > > > > 2008/12/8 D.J. McCloskey <[email protected]> > > > > > >> Hi Tommaso, > >> > >> I saw the mail below on MarkMail and thought you might find what you > need > >> at http://www.alphaworks.ibm.com/tech/lrw. > >> There's a new improved version coming soon but as it stands you will > find > >> automatic language identification annotator there which is fast and easy > >> to > >> improve. It also classifies languages when a sufficient confidence is > not > >> reached into complex text or simple text, essentially indicating whether > >> ngramming or whitespace tokenization would be appropriate for further > >> interrogation. Which languages are you interested in? > >> > >> The technology is available for evaluation and if you have further > >> interest > >> and would like to know more I'd be happy to help you. > >> > >> > >> Subject: Language recognition(Embedded > >> image moved to file: > >> pic21701.gif)Link to this > >> message > >> > >> From: Tommaso Teofili > >> ([email protected]) > >> > >> Date: 12/08/2008 01:22:52 AM > >> > >> List: org.apache.incubator.uima-user > >> > >> > >> > >> > >> > >> > >> Hello, > >> > >> > >> I am writing an AE pipeline and i need to recognize in which language > the > >> starting document is written. My idea is to use the Whitespace Tokenizer > >> and the HMM Tagger together in order to analyze the extracted tokens, > >> calculate the percentage of well known tokens for each language (against > a > >> dictionary) and then select the highest percentage value language... Do > >> you > >> know other (better) language recognition methods? Thanks. Tommaso > >> > >> > >> Regards, > >> -DJ > >> ------------------- > >> D.J McCloskey > >> IBM LanguageWare Architect > >> Email: [email protected] > >> > >> ... our external website: > >> > >> > http://www-306.ibm.com/software/globalization/topics/languageware/index.jsp > >> ... our Alphaworks: http://www.alphaworks.ibm.com/tech/lrw > >> ... our Wikipedia: http://en.wikipedia.org/wiki/Languageware > >> > >> IBM Ireland Product Distribution Limited registered in Ireland with > number > >> 92815. Registered office: Oldbrook House, 24-32 Pembroke Road, > >> Ballsbridge, Dublin 4 > > > > > > >
