[ https://issues.apache.org/jira/browse/TIKA-273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-273. -------------------------------- Resolution: Fixed Fix Version/s: 0.5 Assignee: Jukka Zitting Thanks! Fixed as suggested in revision 813626. > Content encoding in HtmlParser > ------------------------------ > > Key: TIKA-273 > URL: https://issues.apache.org/jira/browse/TIKA-273 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.4, 0.5 > Reporter: Piotr B. > Assignee: Jukka Zitting > Fix For: 0.5 > > > Sometimes content encoding method is stored outside html document, for > instance in mime mail with html attachment. > The problem is for text/html documents without http-equiv section. Actually > there is no way to pass this information to the parser. > My fix for parse method in HtmlParser.java: > - parser.parse(new InputSource(stream)); > + InputSource source = new InputSource(stream); > + String encoding = metadata.get(Metadata.CONTENT_ENCODING); > + if (encoding != null) { > + source.setEncoding(encoding); > + parser.parse(source); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.