Hi,

I'm very new to tika and just start using it ... AND I LOVE IT!

I want to use the language detector for choosing the stemming in my full
text search engine. My plan was to use the specific stemmer (e.g.
"german2") if getLanguage returns "de". However, as getLanguage always
returns something, e.g. "lt" for the content "abc", my plan was to stem
with the specific stemmer if tika is certain, and if not not stemm at all.

However, with my first tests I found that
LanguageIdentifier.isReasonablyCertain always returns false. I found
some JIRA issues and comments about that, e.g.
https://issues.apache.org/jira/browse/TIKA-568, but no real explaination
or solution.

I used some german "lore ipsum" => isReasonablyCertain = false.

I used the "declaration of human rights" in german, as suggested in the
book "tika in action". isReasonablyCertain = false.

I even used the book "tika in action" itself ;). getLanguage = en, but
isReasonablyCertain = false.

The latter two bug me, as both texts are well written in their resp.
language and are reasonable big. Below is the code snippet I used for
testing. Is something wrong with that? Or should I ignore
isReasonablyCertain and find another way of detecting weather the
getLanguage output should be trusted? Or should I always index stemmed
and not stemmed as this question is not really answerable? Any insight
is appreciated.

Best wishes,

Wilm

ps: code snippet i used:

==

String fileName = ...

Tika tika = new Tika();
                
InputStream is = new FileInputStream( fileName );       
String content = tika.parseToString( is );
                
LanguageIdentifier identifier = new LanguageIdentifier( content );
                
System.out.println( identifier.getLanguage() );
System.out.println( identifier.isReasonablyCertain() );

Reply via email to