Language detection is weak. --------------------------- Key: TIKA-209 URL: https://issues.apache.org/jira/browse/TIKA-209 Project: Tika Issue Type: Bug Affects Versions: 0.3 Reporter: Robert Newson
in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination without checking the confidence rating from ICU's CharsetDetector. Please add a configurable level (0-100); if (language != null && match.getConfidence() > THRESHOLD) { metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage()); metadata.set(Metadata.LANGUAGE, match.getLanguage()); } Obviously using charset to imply language is generally weak but it would be sufficient if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.