In the following HTML document, the <a> is inside the <h1> tag which is inside the <p> tag: ------------------- <!DOCTYPE html> <html> <body> <div><h1><a href="http://www.google.com">GOOGLE!</a></h1></div> </body> </html> ------------------- But when I parse it with Tika 1.5 HtmlParser, it adds both the <a> and <h1> tag nodes as direct children of the <p> tag.
The same error happens when I replace the <h1> tag with other header tags <h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag. [Haven't experimented with other replacements]. This seems to be a basic issue. Any help would be deeply appreciated. Cheers, Devarajan
