[ https://issues.apache.org/jira/browse/TIKA-286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ken Krugler closed TIKA-286. ---------------------------- Resolution: Won't Fix Thanks for the info, Uwe. I filed https://sourceforge.net/tracker/?func=detail&aid=2868326&group_id=195122&atid=952178 against CyberNeko. Minor issue, and I can easily fix up my parser comparison code to ignore trailing returns/newlines. > HtmlParser calls characters() with post-body data before processing the > terminating body element. > ------------------------------------------------------------------------------------------------- > > Key: TIKA-286 > URL: https://issues.apache.org/jira/browse/TIKA-286 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.4 > Reporter: Ken Krugler > Priority: Minor > > Using this example data: > {noformat} > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" > "http://www.w3.org/TR/html4/strict.dtd"> > <html lang="en"> > <head> > <meta http-equiv="content-type" content="text/html; charset=utf-8"> > <title>Untitled</title> > <base href="http://newdomain.com"> > </head> > <body> > <a href="link" target="_blank">link1</a> > <a href="http://domain.com/link" target="_blank">link2</a> > </body> > </html> > {noformat} > The handler's characters() method gets called with the following text > Untitled > \n\n > link1 > \n > link2 > \n\n > \n > \n > The first six calls make sense to me. > The last two calls (with a single \n) happen just before endElement("body") > is called, and this is unexpected. > From the offset in the buffer, passed to characters(), these are the return > _after_ the </body> tag. If I put any number of returns in between the > </body> and </html>, they all get passed to characters() before the > endElement("body") call. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.