Hi Devarajan, Please see Chapter 5 of the Tika in Action book for more detail on this. The short answer is that the parsed XHTML representation of *any* upstream file does not necessarily correspond to the upstream (X)HTML representation of the file. The XHTML is an intermediate format that Tika uses to represent the parsed structure content around the text. That is, if you have the following scenario:
PDF->XHTML->content handlers XHTML->XHTML->content handlers Word Doc->XHTML->content handlers Image->XHTML-content handlers .. etc Note that XHTML intermediate is the structured representation of the information around the text in the document (including its metadata). That XHTML is then passed into org.sax.xml.ContentHandlers for stream-based processing downstream. Cheers, Chris ------------------------ Chris Mattmann [email protected] -----Original Message----- From: Devaraja Swami <[email protected]> Reply-To: <[email protected]> Date: Monday, September 8, 2014 7:12 PM To: <[email protected]> Subject: HTML parsing error with <a> tag inside <h1> tag >In the following HTML document, the <a> is inside the <h1> tag which is >inside the <p> tag: >------------------- ><!DOCTYPE html> ><html> ><body> > <div><h1><a href="http://www.google.com">GOOGLE!</a></h1></div> ></body> ></html> >------------------- >But when I parse it with Tika 1.5 HtmlParser, >it adds both the <a> and <h1> tag nodes as direct children of the <p> tag. > >The same error happens when I replace the <h1> tag with other header tags ><h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag. >[Haven't experimented with other replacements]. > >This seems to be a basic issue. >Any help would be deeply appreciated. > >Cheers, >Devarajan > >
