On Wed, 4 Mar 2015, Wilm Schumacher wrote:
I want to use the language detector for choosing the stemming in my full text search engine. My plan was to use the specific stemmer (e.g. "german2") if getLanguage returns "de". However, as getLanguage always returns something, e.g. "lt" for the content "abc", my plan was to stem with the specific stemmer if tika is certain, and if not not stemm at all.
Generally, short phrases are hard to identify, as there are too many languages that are similar for just short bits of content. Normally you need to give a few kb of text
From the javadocs of isReasonablyCertain():
WARNING: Will never return true for small amount of input texts.
I used the "declaration of human rights" in german, as suggested in the book "tika in action". isReasonablyCertain = false. I even used the book "tika in action" itself ;). getLanguage = en, but isReasonablyCertain = false.
Hmm, I would've expected those two to work
LanguageIdentifier identifier = new LanguageIdentifier( content );
Can you try stepping into that with a debugger, and see how the various standard language profiles it compares your text against come out for distance?
Thanks Nick
