Content encoding in HtmlParser
------------------------------

                 Key: TIKA-273
                 URL: https://issues.apache.org/jira/browse/TIKA-273
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.4, 0.5
            Reporter: Piotr B.


Sometimes content encoding method is stored outside html document, for instance 
in mime mail with html attachment.
The problem is for text/html documents without http-equiv section. Actually 
there is no way to pass this information to the parser.

My fix for parse method in HtmlParser.java:

-        parser.parse(new InputSource(stream));
+        InputSource source = new InputSource(stream);
+        String encoding = metadata.get(Metadata.CONTENT_ENCODING);
+        if (encoding != null) {
+            source.setEncoding(encoding);
+        parser.parse(source);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to