HtmlParser calls characters() with post-body data before processing the 
terminating body element.
-------------------------------------------------------------------------------------------------

                 Key: TIKA-286
                 URL: https://issues.apache.org/jira/browse/TIKA-286
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.4
            Reporter: Ken Krugler
            Priority: Minor


Using this example data:

{noformat}
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
       "http://www.w3.org/TR/html4/strict.dtd";>
<html lang="en">
<head>
        <meta http-equiv="content-type" content="text/html; charset=utf-8">
        <title>Untitled</title>
        <base href="http://newdomain.com";>
</head>
<body>

<a href="link" target="_blank">link1</a>
<a href="http://domain.com/link"; target="_blank">link2</a>

</body>
</html>
{noformat}

The handler's characters() method gets called with the following text

Untitled
\n\n
link1
\n
link2
\n\n
\n
\n

The first six calls make sense to me.

The last two calls (with a single \n) happen just before endElement("body") is 
called, and this is unexpected.

>From the offset in the buffer, passed to characters(), these are the return 
>_after_ the </body> tag. If I put any number of returns in between the </body> 
>and </html>, they all get passed to characters() before the endElement("body") 
>call.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to