Improve accuracy of charset detection for HTML pages ----------------------------------------------------
Key: TIKA-333 URL: https://issues.apache.org/jira/browse/TIKA-333 Project: Tika Issue Type: Improvement Affects Versions: 0.5 Reporter: Ken Krugler Priority: Minor Charset detection for HTML pages doesn't work all that well currently, due to the amount of text that's HTML markup at the beginning of the document. A simple solution would be to skip over the first 2K (assuming the document is long enough) before passing bytes to ICU4J. A more complex solution would be to scan for title and body tags, and pass bytes found in each. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.