some parsers produces glued words
---------------------------------

                 Key: TIKA-343
                 URL: https://issues.apache.org/jira/browse/TIKA-343
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.5, 0.6
            Reporter: Piotr B.


Some parsers ignores word/line delimiters. 

Document:
"<html><head></head><body>test<br>test</body></html>"
is decoded by HtmlParser to "testtest".

I think the HtmlParser.mapSafeElement method should be extended by:

        if ("BR".equals(name)) return "br";
        if ("DIV".equals(name)) return "div";
        if ("HR".equals(name)) return "hr";
        if ("ADDRESS".equals(name)) return "address";
        if ("FIELDSET".equals(name)) return "fieldset";
        if ("FORM".equals(name)) return "form";
        if ("NOSCRIPT".equals(name)) return "noscript";
        if ("NOFRAMES".equals(name)) return "noframes";

Also application/xml documents are parsed by removing unknown tags instead of 
replacing them into spaces.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to