[ https://issues.apache.org/jira/browse/TIKA-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789806#action_12789806 ]
Ken Krugler commented on TIKA-344: ---------------------------------- It would be useful for various detectors of charset & language to be able to (a) use different metadata keys for their results, and (b) include a confidence level. That way you could have a top-level resolver that combined the results with all knowledge, including incoming hints, to pick the best result. Though note that for HTML pages, there's a patch to use the charset found in meta tags, which is usually pretty good (and definitely better than the server response header charset or auto-detected charset). See https://issues.apache.org/jira/browse/TIKA-332, as well as: https://issues.apache.org/jira/browse/TIKA-333 https://issues.apache.org/jira/browse/TIKA-334 https://issues.apache.org/jira/browse/TIKA-335 https://issues.apache.org/jira/browse/TIKA-341 > Charset hint in metadata > ------------------------ > > Key: TIKA-344 > URL: https://issues.apache.org/jira/browse/TIKA-344 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 0.6 > Reporter: Piotr B. > Priority: Minor > > It would be nice if TextParser and HtmlParser support > Metadata.CONTENT_ENCODING hint. > In my application I always prefer that hint (if it is present) over the > charset detector result, because charset detector is often wrong on short > inputs (even if match.confidence is 100) and I know that hint if present is > right in 99%. > To be more general, user might be able to change default behaviour by > override a function F(hint, detectorResults) -> charset. > Other solution is to create some standard strategies and let user to choose > one of them: > a) hint is most important > b) charset detector result is most important > c) create some heuristic using detectorResult.confidence, hint and maybe > input length > Maybe the last heuristic method would be good enough for most cases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.