Hi Sebastian, Please reread the second paragraph of my email 😊. In short, it is not possible to initialize the detector in setConf and then reuse it, and initializing it per call would be extremely slow.
Yossi. > -----Original Message----- > From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > Sent: 24 October 2017 12:41 > To: user@nutch.apache.org > Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin > > Hi Yossi, > > why not port it to use > > http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe > tector.html > > The upgrade to Tika 1.16 is already in progress (NUTCH-2439). > > Sebastian > > On 10/24/2017 11:26 AM, Yossi Tamari wrote: > > Hi > > > > > > > > The language-identifier plugin uses > > org.apache.tika.language.LanguageIdentifier for extracting the > > language from the document text. There are two issues with that: > > > > 1. LanguageIdentifier is deprecated in Tika. > > 2. It does not support CJK language (and I suspect a lot of other > > languages - > > https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan > > guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully > > with them - in my experience Chinese was recognized as Italian. > > > > > > > > Since in Tika LanguageIdentifier was superseded by > > org.apache.tika.language.detect.LanguageDetector, it seems obvious to > > make that change in the plugin as well. However, because the design of > > LanguageDetector is terrible, it makes the implementation not > > reentrant, meaning the full language model would have to be reloaded > > on each call to the detector. > > > > > > > > For my needs, I have modified the plugin to use > > com.optimaize.langdetect.LanguageDetector directly, which is what > > Tika's LanguageDetector uses internally (at least by default). My > > question is whether that is a change that should be made to the official > > plugin. > > > > > > > > Thanks, > > > > Yossi. > > > >