RE: Tika detects short Japanese sentences as Chinese

Markus Jelsma Fri, 06 Apr 2018 05:52:21 -0700

Hi - We see this too with Japanese where just a few kanji can spoil the 
detection. The only solution i see is creating a better model.


Markus

 
 
-----Original message-----
> From:[email protected] <[email protected]>
> Sent: Friday 6th April 2018 12:51
> To: [email protected]
> Subject: Re: Tika detects short Japanese sentences as Chinese
> 
> Hi Ken, yes it's OptimaizeLangDetector.
> Should I post it to optimaize mailing list?
> 
> On 2018/04/05 18:42:25, Ken Krugler <[email protected]> wrote: 
> > Hi Artur,
> > 
> > Is the detector that you get back from getDefaultLanguageDetector the 
> > OptimaizeLangDetector?
> > 
> > — Ken
> > 
> > 
> > > On Apr 3, 2018, at 2:55 AM, Artur Rashitov <[email protected]> wrote:
> > > 
> > > Given the following code:
> > > 
> > > val japanese = "私はガラスを食べられます。それは私を傷つけません。"
> > > LanguageDetector.getDefaultLanguageDetector.loadModels().detectAll(japanese)
> > > 
> > > it produces [zh-CN: MEDIUM (0.579961), zh-TW: MEDIUM (0.405015)]
> > > And the same thing for many short Japanese sentences.
> > > 
> > > Apache Tika 1.17
> > 
> > --------------------------------------------
> > http://about.me/kkrugler
> > +1 530-210-6378
> > 
> >

RE: Tika detects short Japanese sentences as Chinese

Reply via email to