Hi - We see this too with Japanese where just a few kanji can spoil the 
detection. The only solution i see is creating a better model.

Markus

 
 
-----Original message-----
> From:ar...@codec.ai <ar...@codec.ai>
> Sent: Friday 6th April 2018 12:51
> To: user@tika.apache.org
> Subject: Re: Tika detects short Japanese sentences as Chinese
> 
> Hi Ken, yes it's OptimaizeLangDetector.
> Should I post it to optimaize mailing list?
> 
> On 2018/04/05 18:42:25, Ken Krugler <kkrugler_li...@transpac.com> wrote: 
> > Hi Artur,
> > 
> > Is the detector that you get back from getDefaultLanguageDetector the 
> > OptimaizeLangDetector?
> > 
> > — Ken
> > 
> > 
> > > On Apr 3, 2018, at 2:55 AM, Artur Rashitov <ar...@codec.ai> wrote:
> > > 
> > > Given the following code:
> > > 
> > > val japanese = "私はガラスを食べられます。それは私を傷つけません。"
> > > LanguageDetector.getDefaultLanguageDetector.loadModels().detectAll(japanese)
> > > 
> > > it produces [zh-CN: MEDIUM (0.579961), zh-TW: MEDIUM (0.405015)]
> > > And the same thing for many short Japanese sentences.
> > > 
> > > Apache Tika 1.17
> > 
> > --------------------------------------------
> > http://about.me/kkrugler
> > +1 530-210-6378
> > 
> > 

Reply via email to