[ https://issues.apache.org/jira/browse/TIKA-339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ken Krugler updated TIKA-339: ----------------------------- Attachment: TIKA-339.patch Patch that uses metadata.add(CharsetDetector language) in the TXTParser, and skips this entirely in the HtmlParser. Plus tests. > HtmlParser & TXTParser should not use language returned by CharsetDetector if > language hint has been provided > ------------------------------------------------------------------------------------------------------------- > > Key: TIKA-339 > URL: https://issues.apache.org/jira/browse/TIKA-339 > Project: Tika > Issue Type: Bug > Affects Versions: 0.6 > Reporter: Ken Krugler > Priority: Minor > Fix For: 0.6 > > Attachments: TIKA-339.patch > > > Currently the code used to call CharsetDetector in both TXTParser and > HtmlParser is that any incoming language in the metadata map gets replaced if > the detector returns a language. > Given the low reliability of this language result, it should only be used in > cases where there is no provided language, as typically this is coming in > from either the Http response header or (for the HtmlParser) a meta tag or > some other tag attribute. In all those cases, the incoming language is more > accurate than the guess by the CharsetDetector. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.