Html parser questions

Ken Krugler Thu, 24 Sep 2009 17:19:22 -0700

Hi all,

[Resending with an image instead of the HTML example - previousattempt was rejected by Apache.org as being spam...weird]

I'm doing a comparison of the Tika HtmlParser with the original NutchHTML parsing code.

I've run into some issues, and wanted input before filing any Jirarequests/bugs.


As an example of a test document:

1. The handler's startElement() never gets called with the <base> tag.I'm assuming this is because <base> isn't part of the SAFE_ELEMENTS set.

But without the base tag, you can't correctly resolve relative URLs inanchor tags.


Seems like <base> should be part of the SAFE_ELEMENTS set.

How as this set of tags derived?

2. The handler's characters() method gets called with the following text

Untitled
\n\n
link1
\n
link2
\n\n
\n
\n

The first six calls make sense to me.

The last two calls (with a single \n) happen just beforeendElement("body") is called, and this is unexpected.

From the offset in the buffer, passed to characters(), these are thereturn _after_ the </body> tag. If I put any number of returns inbetween the </body> and </html>, they all get passed to characters()before the endElement("body") call. This seems like a bug.


Has anybody else noticed this?

Thanks,

-- Ken



--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Html parser questions

Reply via email to