Use http-equiv meta tag charset info when processing HTML documents -------------------------------------------------------------------
Key: TIKA-332 URL: https://issues.apache.org/jira/browse/TIKA-332 Project: Tika Issue Type: Improvement Affects Versions: 0.5 Reporter: Ken Krugler Priority: Critical Currently Tika doesn't use the charset info that's optionally present in HTML documents, via the <meta http-equiv="Content-type" content="text/html; charset=xxx"> tag. If the mime-type is detected as being one that's handled by the HtmlParser, then the first 4-8K of text should be converted from bytes to us-ascii, and then scanned using a regex something like: private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;]+;\\s*charset\\s*=\\s*([^'\"]+)\""); If a charset is detected, this should take precedence over a charset in the HTTP response headers, and (obviously) used to convert the bytes to text before the actual parsing of the document begins. In a test I did of 100 random HTML pages, roughly 15% contained charset info in the meta tag that wound up being different from the detected or HTTP response header charset, so this is a pretty important improvement to make. Without it, Tika isn't that useful for processing HTML pages. I believe one of the reasons why ICU4J doesn't do a good job in detecting the charset for HTML pages is that the first 2K+ of HTML text is often all us-ascii markup, versus real content. I'll file a separate issue about how to improve charset detection for HTML pages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.