Hi Sebastian,

Please reread the second paragraph of my email 😊.
In short, it is not possible to initialize the detector in setConf and then 
reuse it, and initializing it per call would be extremely slow.

        Yossi.


> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> Sent: 24 October 2017 12:41
> To: user@nutch.apache.org
> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> 
> Hi Yossi,
> 
> why not port it to use
> 
> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
> tector.html
> 
> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
> 
> Sebastian
> 
> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> > Hi
> >
> >
> >
> > The language-identifier plugin uses
> > org.apache.tika.language.LanguageIdentifier for extracting the
> > language from the document text. There are two issues with that:
> >
> > 1.  LanguageIdentifier is deprecated in Tika.
> > 2.  It does not support CJK language (and I suspect a lot of other
> > languages -
> > https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
> > guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
> > with them - in my experience Chinese was recognized as Italian.
> >
> >
> >
> > Since in Tika LanguageIdentifier was superseded by
> > org.apache.tika.language.detect.LanguageDetector, it seems obvious to
> > make that change in the plugin as well. However, because the design of
> > LanguageDetector is terrible, it makes the implementation not
> > reentrant, meaning the full language model would have to be reloaded
> > on each call to the detector.
> >
> >
> >
> > For my needs, I have modified the plugin to use
> > com.optimaize.langdetect.LanguageDetector directly, which is what
> > Tika's LanguageDetector uses internally (at least by default). My
> > question is whether that is a change that should be made to the official 
> > plugin.
> >
> >
> >
> > Thanks,
> >
> >                Yossi.
> >
> >


Reply via email to