More trace data: This is the sequence of startElement and endElement calls
from the Tika 1.5 HtmlParser to my downstream content handler:
---------------------------------------------------------------------------------------------
STARTED TIKA PARSING

START ELEMENT <html> <http://www.w3.org/1999/xhtml> <html>
START ELEMENT <head> <http://www.w3.org/1999/xhtml> <head>
START ELEMENT <meta> <http://www.w3.org/1999/xhtml> <meta>
ELEMENT ATTRIBUTES <meta>:  <[content] ---> [ISO-8859-1], [name] --->
[Content-Encoding]>
END ELEMENT <meta> <http://www.w3.org/1999/xhtml> <meta>
START ELEMENT <meta> <http://www.w3.org/1999/xhtml> <meta>
ELEMENT ATTRIBUTES <meta>:  <[content] ---> [text/html;
charset=ISO-8859-1], [name] ---> [Content-Type]>
END ELEMENT <meta> <http://www.w3.org/1999/xhtml> <meta>
START ELEMENT <title> <http://www.w3.org/1999/xhtml> <title>
END ELEMENT <title> <http://www.w3.org/1999/xhtml> <title>
END ELEMENT <head> <http://www.w3.org/1999/xhtml> <head>
START ELEMENT <body> <http://www.w3.org/1999/xhtml> <body>
START ELEMENT <address> <http://www.w3.org/1999/xhtml> <address>
START ELEMENT <cite> <http://www.w3.org/1999/xhtml> <cite>
START ELEMENT <h1> <http://www.w3.org/1999/xhtml> <h1>
END ELEMENT <h1> <http://www.w3.org/1999/xhtml> <h1>
START ELEMENT <a> <http://www.w3.org/1999/xhtml> <a>
ELEMENT ATTRIBUTES <a>:  <[href] ---> [http://www.google.com], [shape] --->
[rect]>
TEXT <GOOGLE!>
END ELEMENT <a> <http://www.w3.org/1999/xhtml> <a>
END ELEMENT <cite> <http://www.w3.org/1999/xhtml> <cite>
END ELEMENT <address> <http://www.w3.org/1999/xhtml> <address>
END ELEMENT <body> <http://www.w3.org/1999/xhtml> <body>
END ELEMENT <html> <http://www.w3.org/1999/xhtml> <html>

COMPLETED TIKA PARSING
---------------------------------------------------------------------------------------------

[I skipped traces for calls to characters(...) which are passing along pure
whitespace.]
[Also looks like Tika (or TagSoup) is adding a <head> and two meta> tags.]

Hope this makes the problem clearer.

Cheers,
Devarajan


On Mon, Sep 8, 2014 at 9:29 PM, Devaraja Swami <[email protected]>
wrote:

> Hi Chris,
>
> Thanks for your reply.
>
> To ad more clarity to my original post, I expect that the Tika 1.5
> HtmlParser should parse any HTML input source and pass along the tags in
> the order appearing in the HTML source correctly to the downstream (user
> supplied) SAX content handler.
> This is not happening currently.
>
> For my HTML source, the Tika upstream parser (HtmlParser) that I call
> using the Tika API is sending the end tag [ endElement() ] of the enclosing
> <h1> tag to the (my) downstream content handler before it sends along the
> start tag [ startElement() ] of the enclosed <a> tag.
>
> IMHO, this is a clear, and quite serious, upstream parsing error.
>
> If possible, could you please shed some light on this, or explain how I
> can overcome this?
> If necessary, I can add a JIRA on this.
>
> Thanks,
> Devarajan
>
>
> On Mon, Sep 8, 2014 at 8:55 PM, Chris Mattmann <[email protected]>
> wrote:
>
>> Hi Devarajan,
>>
>> Please see Chapter 5 of the Tika in Action book for more
>> detail on this. The short answer is that the parsed XHTML
>> representation of *any* upstream file does not necessarily
>> correspond to the upstream (X)HTML representation of the
>> file. The XHTML is an intermediate format that Tika uses
>> to represent the parsed structure content around the text.
>> That is, if you have the following scenario:
>>
>> PDF->XHTML->content handlers
>> XHTML->XHTML->content handlers
>> Word Doc->XHTML->content handlers
>> Image->XHTML-content handlers
>> ..
>> etc
>>
>> Note that XHTML intermediate is the structured representation
>> of the information around the text in the document (including
>> its metadata). That XHTML is then passed into org.sax.xml.ContentHandlers
>> for stream-based processing downstream.
>>
>> Cheers,
>> Chris
>>
>> ------------------------
>> Chris Mattmann
>> [email protected]
>>
>>
>>
>>
>> -----Original Message-----
>> From: Devaraja Swami <[email protected]>
>> Reply-To: <[email protected]>
>> Date: Monday, September 8, 2014 7:12 PM
>> To: <[email protected]>
>> Subject: HTML parsing error with <a> tag inside <h1> tag
>>
>> >In the following HTML document, the <a> is inside the <h1> tag which is
>> >inside the <p> tag:
>> >-------------------
>> ><!DOCTYPE html>
>> ><html>
>> ><body>
>> >       <div><h1><a href="http://www.google.com";>GOOGLE!</a></h1></div>
>> ></body>
>> ></html>
>> >-------------------
>> >But when I parse it with Tika 1.5 HtmlParser,
>> >it adds both the <a> and <h1> tag nodes as direct children of the <p>
>> tag.
>> >
>> >The same error happens when I replace the <h1> tag with other header tags
>> ><h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag.
>> >[Haven't experimented with other replacements].
>> >
>> >This seems to be a basic issue.
>> >Any help would be deeply appreciated.
>> >
>> >Cheers,
>> >Devarajan
>> >
>> >
>>
>>
>>
>

Reply via email to