Improve accuracy of charset detection for HTML pages
----------------------------------------------------

                 Key: TIKA-333
                 URL: https://issues.apache.org/jira/browse/TIKA-333
             Project: Tika
          Issue Type: Improvement
    Affects Versions: 0.5
            Reporter: Ken Krugler
            Priority: Minor


Charset detection for HTML pages doesn't work all that well currently, due to 
the amount of text that's HTML markup at the beginning of the document.

A simple solution would be to skip over the first 2K (assuming the document is 
long enough) before passing bytes to ICU4J.

A more complex solution would be to scan for title and body tags, and pass 
bytes found in each.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to