[ 
https://issues.apache.org/jira/browse/TIKA-332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Krugler updated TIKA-332:
-----------------------------

    Description: 
Currently Tika doesn't use the charset info that's optionally present in HTML 
documents, via the <meta http-equiv="Content-type" content="text/html; 
charset=xxx"> tag.

If the mime-type is detected as being one that's handled by the HtmlParser, 
then the first 4-8K of text should be converted from bytes to us-ascii, and 
then scanned using a regex something like:

    private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = 
Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;]+;\\s*charset\\s*=\\s*([^'\"]+)\"");

If a charset is detected, this should take precedence over a charset in the 
HTTP response headers, and (obviously) used to convert the bytes to text before 
the actual parsing of the document begins.

In a test I did of 100 random HTML pages, roughly 15% contained charset info in 
the meta tag that wound up being different from the detected or HTTP response 
header charset, so this is a pretty important improvement to make. Without it, 
Tika isn't that useful for processing HTML pages.

Though the other problem is that the HtmlParser code doesn't use the 
CharsetDetector, which is another reason for lots of incorrect text. I'll file 
a separate issue about that.

  was:
Currently Tika doesn't use the charset info that's optionally present in HTML 
documents, via the <meta http-equiv="Content-type" content="text/html; 
charset=xxx"> tag.

If the mime-type is detected as being one that's handled by the HtmlParser, 
then the first 4-8K of text should be converted from bytes to us-ascii, and 
then scanned using a regex something like:

    private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = 
Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;]+;\\s*charset\\s*=\\s*([^'\"]+)\"");

If a charset is detected, this should take precedence over a charset in the 
HTTP response headers, and (obviously) used to convert the bytes to text before 
the actual parsing of the document begins.

In a test I did of 100 random HTML pages, roughly 15% contained charset info in 
the meta tag that wound up being different from the detected or HTTP response 
header charset, so this is a pretty important improvement to make. Without it, 
Tika isn't that useful for processing HTML pages.

I believe one of the reasons why ICU4J doesn't do a good job in detecting the 
charset for HTML pages is that the first 2K+ of HTML text is often all us-ascii 
markup, versus real content. I'll file a separate issue about how to improve 
charset detection for HTML pages.


> Use http-equiv meta tag charset info when processing HTML documents
> -------------------------------------------------------------------
>
>                 Key: TIKA-332
>                 URL: https://issues.apache.org/jira/browse/TIKA-332
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Priority: Critical
>
> Currently Tika doesn't use the charset info that's optionally present in HTML 
> documents, via the <meta http-equiv="Content-type" content="text/html; 
> charset=xxx"> tag.
> If the mime-type is detected as being one that's handled by the HtmlParser, 
> then the first 4-8K of text should be converted from bytes to us-ascii, and 
> then scanned using a regex something like:
>     private static final Pattern HTTP_EQUIV_CHARSET_PATTERN = 
> Pattern.compile("<meta\\s+http-equiv\\s*=\\s*['\"]\\s*Content-Type['\"]\\s+content\\s*=\\s*['\"][^;]+;\\s*charset\\s*=\\s*([^'\"]+)\"");
> If a charset is detected, this should take precedence over a charset in the 
> HTTP response headers, and (obviously) used to convert the bytes to text 
> before the actual parsing of the document begins.
> In a test I did of 100 random HTML pages, roughly 15% contained charset info 
> in the meta tag that wound up being different from the detected or HTTP 
> response header charset, so this is a pretty important improvement to make. 
> Without it, Tika isn't that useful for processing HTML pages.
> Though the other problem is that the HtmlParser code doesn't use the 
> CharsetDetector, which is another reason for lots of incorrect text. I'll 
> file a separate issue about that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to