Hi Wilm, Sorry for the long delay in following up - I finally got around to working on the issue of language identification in Tika.
Most of the work is happening as part of https://issues.apache.org/jira/browse/TIKA-1723, which integrates a 3rd party language identification package (language-detector). This will solve the issue of isReasonablyCertain() always returning falseā¦and I've added tests to confirm :) Regards, -- Ken > From: Wilm Schumacher > Sent: March 4, 2015 2:32:19pm PST > To: [email protected] > Subject: LanguageIdentifier.isReasonablyCertain is always false > > Hi, > > I'm very new to tika and just start using it ... AND I LOVE IT! > > I want to use the language detector for choosing the stemming in my full > text search engine. My plan was to use the specific stemmer (e.g. > "german2") if getLanguage returns "de". However, as getLanguage always > returns something, e.g. "lt" for the content "abc", my plan was to stem > with the specific stemmer if tika is certain, and if not not stemm at all. > > However, with my first tests I found that > LanguageIdentifier.isReasonablyCertain always returns false. I found > some JIRA issues and comments about that, e.g. > https://issues.apache.org/jira/browse/TIKA-568, but no real explaination > or solution. > > I used some german "lore ipsum" => isReasonablyCertain = false. > > I used the "declaration of human rights" in german, as suggested in the > book "tika in action". isReasonablyCertain = false. > > I even used the book "tika in action" itself ;). getLanguage = en, but > isReasonablyCertain = false. > > The latter two bug me, as both texts are well written in their resp. > language and are reasonable big. Below is the code snippet I used for > testing. Is something wrong with that? Or should I ignore > isReasonablyCertain and find another way of detecting weather the > getLanguage output should be trusted? Or should I always index stemmed > and not stemmed as this question is not really answerable? Any insight > is appreciated. > > Best wishes, > > Wilm > > ps: code snippet i used: > > == > > String fileName = ... > > Tika tika = new Tika(); > > InputStream is = new FileInputStream( fileName ); > String content = tika.parseToString( is ); > > LanguageIdentifier identifier = new LanguageIdentifier( content ); > > System.out.println( identifier.getLanguage() ); > System.out.println( identifier.isReasonablyCertain() ); > -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
