RE: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Yossi Tamari
October 2017 15:25 > To: user@nutch.apache.org > Subject: RE: Usage of Tika LanguageIdentifier in language-identifier plugin > > Hello, > > Not sure what the problem is but , buried deep in our parser we also use > Optimaize, previously lang-detect. We load models once, inside a s

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Markus Jelsma
e.org > Subject: RE: Usage of Tika LanguageIdentifier in language-identifier plugin > > Hi Markus, > > Can you please explain what do you mean by "our parser", because I'm pretty > sure the language-identifier plugin is not using Optimaize. > > Thanks, >

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Markus Jelsma
- > From:Sebastian Nagel <wastl.na...@googlemail.com> > Sent: Tuesday 24th October 2017 14:11 > To: user@nutch.apache.org > Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin > > Hi Yossi, > > > does not separate the Detector object, wh

Re: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Sebastian Nagel
at the project has not seen a single commit in > the last 4 years, and the usage numbers are also quite low, gives me pause... > > >> -Original Message- >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] >> Sent: 24 October 2017 13:18 >> To: u

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Yossi Tamari
stl.na...@googlemail.com] > Sent: 24 October 2017 13:18 > To: user@nutch.apache.org > Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin > > Hi Yossi, > > sorry while fast-reading I've thought it's about the old LanguageIdentifier. > > &

Re: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Sebastian Nagel
> Yossi. > > >> -Original Message- >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] >> Sent: 24 October 2017 12:41 >> To: user@nutch.apache.org >> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin >>

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Yossi Tamari
stl.na...@googlemail.com] > Sent: 24 October 2017 12:41 > To: user@nutch.apache.org > Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin > > Hi Yossi, > > why not port it to use > > http://tika.apache.org/1.16/api/org/apache/tika/languag

Re: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Sebastian Nagel
Hi Yossi, why not port it to use http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDetector.html The upgrade to Tika 1.16 is already in progress (NUTCH-2439). Sebastian On 10/24/2017 11:26 AM, Yossi Tamari wrote: > Hi > > > > The language-identifier plugin uses >

Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Yossi Tamari
Hi The language-identifier plugin uses org.apache.tika.language.LanguageIdentifier for extracting the language from the document text. There are two issues with that: 1. LanguageIdentifier is deprecated in Tika. 2. It does not support CJK language (and I suspect a lot of other