[ 
https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702953#action_12702953
 ] 

Jukka Zitting commented on TIKA-209:
------------------------------------

The getConfidence() method in CharsetMatch is for the confidence level of the 
character encoding detection, not of the language detection.

I'm not sure if ICU4J has an easy way to determine the confidence level of 
language detection.

Robert: Do you know how the LanguageIdentifier stuff differs from the stuff in 
ICU4J?

> Language detection is weak.
> ---------------------------
>
>                 Key: TIKA-209
>                 URL: https://issues.apache.org/jira/browse/TIKA-209
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Robert Newson
>
> in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language 
> determination without checking the confidence rating from ICU's 
> CharsetDetector.
> Please add a configurable level (0-100);
> if (language != null && match.getConfidence() > THRESHOLD) {
>   metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage());
>   metadata.set(Metadata.LANGUAGE, match.getLanguage());
> }
> Obviously using charset to imply language is generally weak but it would be 
> sufficient if the confidence threshold was controlled. Today, the text 
> "hello" is tagged as French, for example. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to