[ 
https://issues.apache.org/jira/browse/TIKA-286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Krugler closed TIKA-286.
----------------------------

    Resolution: Won't Fix

Thanks for the info, Uwe.

I filed 
https://sourceforge.net/tracker/?func=detail&aid=2868326&group_id=195122&atid=952178
 against CyberNeko. Minor issue, and I can easily fix up my parser comparison 
code to ignore trailing returns/newlines.


> HtmlParser calls characters() with post-body data before processing the 
> terminating body element.
> -------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-286
>                 URL: https://issues.apache.org/jira/browse/TIKA-286
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>            Priority: Minor
>
> Using this example data:
> {noformat}
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
>        "http://www.w3.org/TR/html4/strict.dtd";>
> <html lang="en">
> <head>
>       <meta http-equiv="content-type" content="text/html; charset=utf-8">
>       <title>Untitled</title>
>       <base href="http://newdomain.com";>
> </head>
> <body>
> <a href="link" target="_blank">link1</a>
> <a href="http://domain.com/link"; target="_blank">link2</a>
> </body>
> </html>
> {noformat}
> The handler's characters() method gets called with the following text
> Untitled
> \n\n
> link1
> \n
> link2
> \n\n
> \n
> \n
> The first six calls make sense to me.
> The last two calls (with a single \n) happen just before endElement("body") 
> is called, and this is unexpected.
> From the offset in the buffer, passed to characters(), these are the return 
> _after_ the </body> tag. If I put any number of returns in between the 
> </body> and </html>, they all get passed to characters() before the 
> endElement("body") call.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to