[ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688660#action_12688660 ]
Robert Newson commented on TIKA-209: ------------------------------------ FYI: In my project (couchdb-lucene) I've pulled in the ngram-based LanguageIdentifier from Nutch 0.9. Since it's Apache 2 licensed, it might be something worth integrating with Tika directly? > Language detection is weak. > --------------------------- > > Key: TIKA-209 > URL: https://issues.apache.org/jira/browse/TIKA-209 > Project: Tika > Issue Type: Bug > Affects Versions: 0.3 > Reporter: Robert Newson > > in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language > determination without checking the confidence rating from ICU's > CharsetDetector. > Please add a configurable level (0-100); > if (language != null && match.getConfidence() > THRESHOLD) { > metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage()); > metadata.set(Metadata.LANGUAGE, match.getLanguage()); > } > Obviously using charset to imply language is generally weak but it would be > sufficient if the confidence threshold was controlled. Today, the text > "hello" is tagged as French, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.