Hi, I'm very new to tika and just start using it ... AND I LOVE IT!
I want to use the language detector for choosing the stemming in my full text search engine. My plan was to use the specific stemmer (e.g. "german2") if getLanguage returns "de". However, as getLanguage always returns something, e.g. "lt" for the content "abc", my plan was to stem with the specific stemmer if tika is certain, and if not not stemm at all. However, with my first tests I found that LanguageIdentifier.isReasonablyCertain always returns false. I found some JIRA issues and comments about that, e.g. https://issues.apache.org/jira/browse/TIKA-568, but no real explaination or solution. I used some german "lore ipsum" => isReasonablyCertain = false. I used the "declaration of human rights" in german, as suggested in the book "tika in action". isReasonablyCertain = false. I even used the book "tika in action" itself ;). getLanguage = en, but isReasonablyCertain = false. The latter two bug me, as both texts are well written in their resp. language and are reasonable big. Below is the code snippet I used for testing. Is something wrong with that? Or should I ignore isReasonablyCertain and find another way of detecting weather the getLanguage output should be trusted? Or should I always index stemmed and not stemmed as this question is not really answerable? Any insight is appreciated. Best wishes, Wilm ps: code snippet i used: == String fileName = ... Tika tika = new Tika(); InputStream is = new FileInputStream( fileName ); String content = tika.parseToString( is ); LanguageIdentifier identifier = new LanguageIdentifier( content ); System.out.println( identifier.getLanguage() ); System.out.println( identifier.isReasonablyCertain() );
