Re: LanguageIdentifier.isReasonablyCertain is always false

Nick Burch Wed, 04 Mar 2015 22:32:12 -0800

On Wed, 4 Mar 2015, Wilm Schumacher wrote:

I want to use the language detector for choosing the stemming in my fulltext search engine. My plan was to use the specific stemmer (e.g."german2") if getLanguage returns "de". However, as getLanguage alwaysreturns something, e.g. "lt" for the content "abc", my plan was to stemwith the specific stemmer if tika is certain, and if not not stemm atall.

Generally, short phrases are hard to identify, as there are too manylanguages that are similar for just short bits of content. Normally youneed to give a few kb of text

From the javadocs of isReasonablyCertain():

WARNING: Will never return true for small amount of input texts.

I used the "declaration of human rights" in german, as suggested in the
book "tika in action". isReasonablyCertain = false.

I even used the book "tika in action" itself ;). getLanguage = en, but
isReasonablyCertain = false.


Hmm, I would've expected those two to work

LanguageIdentifier identifier = new LanguageIdentifier( content );

Can you try stepping into that with a debugger, and see how the variousstandard language profiles it compares your text against come out fordistance?


Thanks
Nick

Re: LanguageIdentifier.isReasonablyCertain is always false

Reply via email to