Hi all,

[Resending with an image instead of the HTML example - previous attempt was rejected by Apache.org as being spam...weird]

I'm doing a comparison of the Tika HtmlParser with the original Nutch HTML parsing code.

I've run into some issues, and wanted input before filing any Jira requests/bugs.

As an example of a test document:



1. The handler's startElement() never gets called with the <base> tag. I'm assuming this is because <base> isn't part of the SAFE_ELEMENTS set.

But without the base tag, you can't correctly resolve relative URLs in anchor tags.

Seems like <base> should be part of the SAFE_ELEMENTS set.

How as this set of tags derived?

2. The handler's characters() method gets called with the following text

Untitled
\n\n
link1
\n
link2
\n\n
\n
\n

The first six calls make sense to me.

The last two calls (with a single \n) happen just before endElement("body") is called, and this is unexpected.

From the offset in the buffer, passed to characters(), these are the return _after_ the </body> tag. If I put any number of returns in between the </body> and </html>, they all get passed to characters() before the endElement("body") call. This seems like a bug.

Has anybody else noticed this?

Thanks,

-- Ken



--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Reply via email to