Hi Otis, But this is not freeware ,right?
On 2/17/09, Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote: > > Hi, > > No, Tika doesn't do LangID. I haven't used ngramj, so I can't speak for > its accuracy nor speed (but I know the code has been around for > years). Another LangID implementation is at the URL below my name. > > Otis -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > ________________________________ > From: revathy arun <revas...@gmail.com> > To: solr-user@lucene.apache.org > Sent: Tuesday, February 17, 2009 6:39:40 PM > Subject: Re: Multilanguage > > Does Apache Tika help find the language of the given document? > > > > On 2/17/09, Till Kinstler <kinst...@gbv.de> wrote: > > > > Paul Libbrecht schrieb: > > > > Clearly, then, something that matches words in a dictionary and decides > on > >> the language based on the language of the majority could do a decent job > to > >> decide the analyzer. > >> > >> Does such a tool exist? > >> > > > > I once played around with http://ngramj.sourceforge.net/ for language > > guessing. It did a good job. It doesn't use dictionaries for language > > identification but a statistical approach using ngrams. > > I don't have any precise numbers, but out of about 10000 documents in > > different languages (most in English, German and French, few in other > > european languages like Polish) there were only some 10 not identified > > correctly. > > > > Till > > > > -- > > Till Kinstler > > Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG) > > Platz der Göttinger Sieben 1, D 37073 Göttingen > > kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de > > >