In the following HTML document, the <a> is inside the <h1> tag which is
inside the <p> tag:
-------------------
<!DOCTYPE html>
<html>
<body>
<div><h1><a href="http://www.google.com";>GOOGLE!</a></h1></div>
</body>
</html>
-------------------
But when I parse it with Tika 1.5 HtmlParser,
it adds both the <a> and <h1> tag nodes as direct children of the <p> tag.

The same error happens when I replace the <h1> tag with other header tags
<h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag.
[Haven't experimented with other replacements].

This seems to be a basic issue.
Any help would be deeply appreciated.

Cheers,
Devarajan

Reply via email to