[jira] Closed: (TIKA-333) Improve accuracy of charset detection for HTML pages

Ken Krugler (JIRA) Wed, 25 Nov 2009 10:39:05 -0800

     [ 
https://issues.apache.org/jira/browse/TIKA-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ken Krugler closed TIKA-333.
----------------------------

    Resolution: Not A Problem

In actually walking the parse code, I see that the real problem is that the 
HtmlParser code doesn't use the CharsetDetector. If no charset is passed in, 
then it just calls TagSoup, which by default uses the platform encoding. See 
[http://home.ccil.org/~cowan/XML/tagsoup/].

So I'll open another issue for the HtmlParser.

> Improve accuracy of charset detection for HTML pages
> ----------------------------------------------------
>
>                 Key: TIKA-333
>                 URL: https://issues.apache.org/jira/browse/TIKA-333
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Priority: Minor
>
> Charset detection for HTML pages doesn't work all that well currently, due to 
> the amount of text that's HTML markup at the beginning of the document.
> A simple solution would be to skip over the first 2K (assuming the document 
> is long enough) before passing bytes to ICU4J.
> A more complex solution would be to scan for title and body tags, and pass 
> bytes found in each.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (TIKA-333) Improve accuracy of charset detection for HTML pages

Reply via email to