[jira] Commented: (TIKA-332) Use http-equiv meta tag charset info when processing HTML documents

Ken Krugler (JIRA) Wed, 25 Nov 2009 10:39:04 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782550#action_12782550
 ]


Ken Krugler commented on TIKA-332:
----------------------------------

It turns out the HtmlParser code doesn't even use the CharsetDetector support - 
this is only being used by the TXTParser, as far as I can tell (and incorrectly 
at that).


> Use http-equiv meta tag charset info when processing HTML documents
> -------------------------------------------------------------------
>
>                 Key: TIKA-332
>                 URL: https://issues.apache.org/jira/browse/TIKA-332
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Priority: Critical
>
> Currently Tika doesn't use the charset info that's optionally present in HTML 
> documents, via the <meta http-equiv="Content-type" content="text/html; 
> charset=xxx"> tag.
> If the mime-type is detected as being one that's handled by the HtmlParser, 
> then the first 4-8K of text should be converted from bytes to us-ascii, and 
> then scanned using a regex something like:
>     private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = 
> Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;]+;\\s*charset\\s*=\\s*([^'\"]+)\"");
> If a charset is detected, this should take precedence over a charset in the 
> HTTP response headers, and (obviously) used to convert the bytes to text 
> before the actual parsing of the document begins.
> In a test I did of 100 random HTML pages, roughly 15% contained charset info 
> in the meta tag that wound up being different from the detected or HTTP 
> response header charset, so this is a pretty important improvement to make. 
> Without it, Tika isn't that useful for processing HTML pages.
> I believe one of the reasons why ICU4J doesn't do a good job in detecting the 
> charset for HTML pages is that the first 2K+ of HTML text is often all 
> us-ascii markup, versus real content. I'll file a separate issue about how to 
> improve charset detection for HTML pages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-332) Use http-equiv meta tag charset info when processing HTML documents

Reply via email to