Hi - We see this too with Japanese where just a few kanji can spoil the detection. The only solution i see is creating a better model.
Markus -----Original message----- > From:ar...@codec.ai <ar...@codec.ai> > Sent: Friday 6th April 2018 12:51 > To: user@tika.apache.org > Subject: Re: Tika detects short Japanese sentences as Chinese > > Hi Ken, yes it's OptimaizeLangDetector. > Should I post it to optimaize mailing list? > > On 2018/04/05 18:42:25, Ken Krugler <kkrugler_li...@transpac.com> wrote: > > Hi Artur, > > > > Is the detector that you get back from getDefaultLanguageDetector the > > OptimaizeLangDetector? > > > > — Ken > > > > > > > On Apr 3, 2018, at 2:55 AM, Artur Rashitov <ar...@codec.ai> wrote: > > > > > > Given the following code: > > > > > > val japanese = "私はガラスを食べられます。それは私を傷つけません。" > > > LanguageDetector.getDefaultLanguageDetector.loadModels().detectAll(japanese) > > > > > > it produces [zh-CN: MEDIUM (0.579961), zh-TW: MEDIUM (0.405015)] > > > And the same thing for many short Japanese sentences. > > > > > > Apache Tika 1.17 > > > > -------------------------------------------- > > http://about.me/kkrugler > > +1 530-210-6378 > > > >