Hi Devarajan,

Please see Chapter 5 of the Tika in Action book for more
detail on this. The short answer is that the parsed XHTML
representation of *any* upstream file does not necessarily
correspond to the upstream (X)HTML representation of the
file. The XHTML is an intermediate format that Tika uses
to represent the parsed structure content around the text.
That is, if you have the following scenario:

PDF->XHTML->content handlers
XHTML->XHTML->content handlers
Word Doc->XHTML->content handlers
Image->XHTML-content handlers
..
etc

Note that XHTML intermediate is the structured representation
of the information around the text in the document (including
its metadata). That XHTML is then passed into org.sax.xml.ContentHandlers
for stream-based processing downstream.

Cheers,
Chris

------------------------
Chris Mattmann
[email protected]




-----Original Message-----
From: Devaraja Swami <[email protected]>
Reply-To: <[email protected]>
Date: Monday, September 8, 2014 7:12 PM
To: <[email protected]>
Subject: HTML parsing error with <a> tag inside <h1> tag

>In the following HTML document, the <a> is inside the <h1> tag which is
>inside the <p> tag:
>-------------------
><!DOCTYPE html>
><html>
><body>
>       <div><h1><a href="http://www.google.com";>GOOGLE!</a></h1></div>
></body>
></html>
>-------------------
>But when I parse it with Tika 1.5 HtmlParser,
>it adds both the <a> and <h1> tag nodes as direct children of the <p> tag.
>
>The same error happens when I replace the <h1> tag with other header tags
><h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag.
>[Haven't experimented with other replacements].
>
>This seems to be a basic issue.
>Any help would be deeply appreciated.
>
>Cheers,
>Devarajan
>
>


Reply via email to