[ https://issues.apache.org/jira/browse/TIKA-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-344. -------------------------------- Resolution: Duplicate Resolving this as a duplicate of all the related and more specific issues filed by Ken. It looks like after applying all his patches we've pretty much covered the use case expressed here. > Charset hint in metadata > ------------------------ > > Key: TIKA-344 > URL: https://issues.apache.org/jira/browse/TIKA-344 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 0.6 > Reporter: Piotr B. > Priority: Minor > > It would be nice if TextParser and HtmlParser support > Metadata.CONTENT_ENCODING hint. > In my application I always prefer that hint (if it is present) over the > charset detector result, because charset detector is often wrong on short > inputs (even if match.confidence is 100) and I know that hint if present is > right in 99%. > To be more general, user might be able to change default behaviour by > override a function F(hint, detectorResults) -> charset. > Other solution is to create some standard strategies and let user to choose > one of them: > a) hint is most important > b) charset detector result is most important > c) create some heuristic using detectorResult.confidence, hint and maybe > input length > Maybe the last heuristic method would be good enough for most cases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.