Re: Multilanguage

revathy arun Tue, 17 Feb 2009 08:58:04 -0800

Hi Otis,

But this is not freeware ,right?





On 2/17/09, Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote:
>
> Hi,
>
> No, Tika doesn't do LangID.  I haven't used ngramj, so I can't speak for
> its accuracy nor speed (but I know the code has been around for
> years).  Another LangID implementation is at the URL below my name.
>
> Otis --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
>
> ________________________________
> From: revathy arun <revas...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, February 17, 2009 6:39:40 PM
> Subject: Re: Multilanguage
>
> Does Apache Tika help find the language of the given document?
>
>
>
> On 2/17/09, Till Kinstler <kinst...@gbv.de> wrote:
> >
> > Paul Libbrecht schrieb:
> >
> > Clearly, then, something that matches words in a dictionary and decides
> on
> >> the language based on the language of the majority could do a decent job
> to
> >> decide the analyzer.
> >>
> >> Does such a tool exist?
> >>
> >
> > I once played around with http://ngramj.sourceforge.net/ for language
> > guessing. It did a good job. It doesn't use dictionaries for language
> > identification but a statistical approach using ngrams.
> > I don't have any precise numbers, but out of about 10000 documents in
> > different languages (most in English, German and French, few in other
> > european languages like Polish) there were only some 10 not identified
> > correctly.
> >
> > Till
> >
> > --
> > Till Kinstler
> > Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
> > Platz der Göttinger Sieben 1, D 37073 Göttingen
> > kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
> >
>

Re: Multilanguage

Reply via email to