Content encoding in HtmlParser ------------------------------ Key: TIKA-273 URL: https://issues.apache.org/jira/browse/TIKA-273 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.4, 0.5 Reporter: Piotr B.
Sometimes content encoding method is stored outside html document, for instance in mime mail with html attachment. The problem is for text/html documents without http-equiv section. Actually there is no way to pass this information to the parser. My fix for parse method in HtmlParser.java: - parser.parse(new InputSource(stream)); + InputSource source = new InputSource(stream); + String encoding = metadata.get(Metadata.CONTENT_ENCODING); + if (encoding != null) { + source.setEncoding(encoding); + parser.parse(source); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.