Hi Chris,

Thanks for your reply.

To ad more clarity to my original post, I expect that the Tika 1.5
HtmlParser should parse any HTML input source and pass along the tags in
the order appearing in the HTML source correctly to the downstream (user
supplied) SAX content handler.
This is not happening currently.

For my HTML source, the Tika upstream parser (HtmlParser) that I call using
the Tika API is sending the end tag [ endElement() ] of the enclosing <h1>
tag to the (my) downstream content handler before it sends along the start
tag [ startElement() ] of the enclosed <a> tag.

IMHO, this is a clear, and quite serious, upstream parsing error.

If possible, could you please shed some light on this, or explain how I can
overcome this?
If necessary, I can add a JIRA on this.

Thanks,
Devarajan


On Mon, Sep 8, 2014 at 8:55 PM, Chris Mattmann <[email protected]>
wrote:

> Hi Devarajan,
>
> Please see Chapter 5 of the Tika in Action book for more
> detail on this. The short answer is that the parsed XHTML
> representation of *any* upstream file does not necessarily
> correspond to the upstream (X)HTML representation of the
> file. The XHTML is an intermediate format that Tika uses
> to represent the parsed structure content around the text.
> That is, if you have the following scenario:
>
> PDF->XHTML->content handlers
> XHTML->XHTML->content handlers
> Word Doc->XHTML->content handlers
> Image->XHTML-content handlers
> ..
> etc
>
> Note that XHTML intermediate is the structured representation
> of the information around the text in the document (including
> its metadata). That XHTML is then passed into org.sax.xml.ContentHandlers
> for stream-based processing downstream.
>
> Cheers,
> Chris
>
> ------------------------
> Chris Mattmann
> [email protected]
>
>
>
>
> -----Original Message-----
> From: Devaraja Swami <[email protected]>
> Reply-To: <[email protected]>
> Date: Monday, September 8, 2014 7:12 PM
> To: <[email protected]>
> Subject: HTML parsing error with <a> tag inside <h1> tag
>
> >In the following HTML document, the <a> is inside the <h1> tag which is
> >inside the <p> tag:
> >-------------------
> ><!DOCTYPE html>
> ><html>
> ><body>
> >       <div><h1><a href="http://www.google.com";>GOOGLE!</a></h1></div>
> ></body>
> ></html>
> >-------------------
> >But when I parse it with Tika 1.5 HtmlParser,
> >it adds both the <a> and <h1> tag nodes as direct children of the <p> tag.
> >
> >The same error happens when I replace the <h1> tag with other header tags
> ><h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag.
> >[Haven't experimented with other replacements].
> >
> >This seems to be a basic issue.
> >Any help would be deeply appreciated.
> >
> >Cheers,
> >Devarajan
> >
> >
>
>
>

Reply via email to