Hi Wilm,

Sorry for the long delay in following up - I finally got around to working on 
the issue of language identification in Tika.

Most of the work is happening as part of 
https://issues.apache.org/jira/browse/TIKA-1723, which integrates a 3rd party 
language identification package (language-detector).

This will solve the issue of isReasonablyCertain() always returning false…and 
I've added tests to confirm :)

Regards,

-- Ken

> From: Wilm Schumacher
> Sent: March 4, 2015 2:32:19pm PST
> To: [email protected]
> Subject: LanguageIdentifier.isReasonablyCertain is always false
> 
> Hi,
> 
> I'm very new to tika and just start using it ... AND I LOVE IT!
> 
> I want to use the language detector for choosing the stemming in my full
> text search engine. My plan was to use the specific stemmer (e.g.
> "german2") if getLanguage returns "de". However, as getLanguage always
> returns something, e.g. "lt" for the content "abc", my plan was to stem
> with the specific stemmer if tika is certain, and if not not stemm at all.
> 
> However, with my first tests I found that
> LanguageIdentifier.isReasonablyCertain always returns false. I found
> some JIRA issues and comments about that, e.g.
> https://issues.apache.org/jira/browse/TIKA-568, but no real explaination
> or solution.
> 
> I used some german "lore ipsum" => isReasonablyCertain = false.
> 
> I used the "declaration of human rights" in german, as suggested in the
> book "tika in action". isReasonablyCertain = false.
> 
> I even used the book "tika in action" itself ;). getLanguage = en, but
> isReasonablyCertain = false.
> 
> The latter two bug me, as both texts are well written in their resp.
> language and are reasonable big. Below is the code snippet I used for
> testing. Is something wrong with that? Or should I ignore
> isReasonablyCertain and find another way of detecting weather the
> getLanguage output should be trusted? Or should I always index stemmed
> and not stemmed as this question is not really answerable? Any insight
> is appreciated.
> 
> Best wishes,
> 
> Wilm
> 
> ps: code snippet i used:
> 
> ==
> 
> String fileName = ...
> 
> Tika tika = new Tika();
>               
> InputStream is = new FileInputStream( fileName );     
> String content = tika.parseToString( is );
>               
> LanguageIdentifier identifier = new LanguageIdentifier( content );
>               
> System.out.println( identifier.getLanguage() );
> System.out.println( identifier.isReasonablyCertain() );
> 




--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to