[ https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ken Krugler updated TIKA-332: ----------------------------- Description: Currently Tika doesn't use the charset info that's optionally present in HTML documents, via the <meta http-equiv="Content-type" content="text/html; charset=xxx"> tag. If the mime-type is detected as being one that's handled by the HtmlParser, then the first 4-8K of text should be converted from bytes to us-ascii, and then scanned using a regex something like: private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;]+;\\s*charset\\s*=\\s*([^'\"]+)\""); If a charset is detected, this should take precedence over a charset in the HTTP response headers, and (obviously) used to convert the bytes to text before the actual parsing of the document begins. In a test I did of 100 random HTML pages, roughly 15% contained charset info in the meta tag that wound up being different from the detected or HTTP response header charset, so this is a pretty important improvement to make. Without it, Tika isn't that useful for processing HTML pages. Though the other problem is that the HtmlParser code doesn't use the CharsetDetector, which is another reason for lots of incorrect text. I'll file a separate issue about that. was: Currently Tika doesn't use the charset info that's optionally present in HTML documents, via the <meta http-equiv="Content-type" content="text/html; charset=xxx"> tag. If the mime-type is detected as being one that's handled by the HtmlParser, then the first 4-8K of text should be converted from bytes to us-ascii, and then scanned using a regex something like: private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;]+;\\s*charset\\s*=\\s*([^'\"]+)\""); If a charset is detected, this should take precedence over a charset in the HTTP response headers, and (obviously) used to convert the bytes to text before the actual parsing of the document begins. In a test I did of 100 random HTML pages, roughly 15% contained charset info in the meta tag that wound up being different from the detected or HTTP response header charset, so this is a pretty important improvement to make. Without it, Tika isn't that useful for processing HTML pages. I believe one of the reasons why ICU4J doesn't do a good job in detecting the charset for HTML pages is that the first 2K+ of HTML text is often all us-ascii markup, versus real content. I'll file a separate issue about how to improve charset detection for HTML pages. > Use http-equiv meta tag charset info when processing HTML documents > ------------------------------------------------------------------- > > Key: TIKA-332 > URL: https://issues.apache.org/jira/browse/TIKA-332 > Project: Tika > Issue Type: Improvement > Affects Versions: 0.5 > Reporter: Ken Krugler > Priority: Critical > > Currently Tika doesn't use the charset info that's optionally present in HTML > documents, via the <meta http-equiv="Content-type" content="text/html; > charset=xxx"> tag. > If the mime-type is detected as being one that's handled by the HtmlParser, > then the first 4-8K of text should be converted from bytes to us-ascii, and > then scanned using a regex something like: > private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = > Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;]+;\\s*charset\\s*=\\s*([^'\"]+)\""); > If a charset is detected, this should take precedence over a charset in the > HTTP response headers, and (obviously) used to convert the bytes to text > before the actual parsing of the document begins. > In a test I did of 100 random HTML pages, roughly 15% contained charset info > in the meta tag that wound up being different from the detected or HTTP > response header charset, so this is a pretty important improvement to make. > Without it, Tika isn't that useful for processing HTML pages. > Though the other problem is that the HtmlParser code doesn't use the > CharsetDetector, which is another reason for lots of incorrect text. I'll > file a separate issue about that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.