Hi Devarajan,

You are correct that the issue is with TagSoup, where it assumes you can't have 
an anchor (<a>) element inside of a header (<h1>, <h2>, etc) element.

And yes, TagSoup is missing support for HTML5 tags; I believe Markus Jelsma was 
trying to get that fixed, but I don't think he's had much luck with the TagSoup 
author.

Originally Tika did use NekoHTML, but that was replaced by TagSoup back in 
2009. See https://issues.apache.org/jira/browse/TIKA-310 for details.

I haven't looked at how hard it would be to make the HTML parser pluggable. 
Patches welcome :)

Though as a first cut, you could create your own parser that clones the current 
HTML support and does a hard-coded replacement, for testing purposes.

One final point - since Tika tries to guarantee XHTML 1.0-compliant output, you 
cannot assume that whatever you put into Tika will give you a corresponding DOM.

-- Ken

> From: Devaraja Swami
> Sent: September 8, 2014 10:26:24pm PDT
> To: [email protected]
> Subject: Re: HTML parsing error with <a> tag inside <h1> tag
> 
> That indeed is my, IMHO quite reasonable, expectation!
> 
> In many content analysis applications, like mine, the resulting DOM structure 
> is the objective of parsing, not merely a dump of the textual content. 
> In these applications, the DOM structure is generated by a user-provided 
> content handler which accepts the stream of SAX content handler calls from 
> the Tika parser. As an example [though not my application of interest], this 
> is exactly how Nutch uses Tika, where the Nutch-provided content handler is 
> called DOMBuilder. 
> 
> [Since you were a contributor to Nutch before spinning off Tika, I am sure 
> you can understand its importance :-) ]
> 
> To shed further light on the problem, and to lighten your defensive concern, 
> I don't believe Tika source code is jumbling the order of the tags. 
> I think it is your upstream parser - TagSoup to be precise:
> 
> I just ran the same file directly through the latest TagSoup and the latest 
> NekoHTML.
> The former causes the order jumbling above, where the latter faithfully 
> forwards the incoming tag order.
> In fact, TagSoup has other problems, like lack of handling of HTML5 tags, for 
> which I had to develop workarounds using custom HTML schema class (similar to 
> the workaround you posted some time ago).
> 
> So my second question is, is it possible for you to alter Tika so that the 
> user can specify at runtime the present raw HTML parser (TagSoup or NekoHTML) 
> to the Tika HtmlParser, and bundle both options in the Tika dependencies? 
> Failing this, I have to create an internal hack of the Tika HtmlParser to use 
> NekoHTML instead of TagSoup. 
> 
> My concern that Tika should indeed guarantee the faithful forwarding of the 
> incoming order of tags and text [just like the contract for any SAX compliant 
> parser] still holds though...
> 
> Cheers,
> Devarajan
> 
> 
> 
> On Mon, Sep 8, 2014 at 10:01 PM, Mattmann, Chris A (3980) 
> <[email protected]> wrote:
> Thanks Devarajan.
> 
> I think your expectation below is not the way that Tika handles
> parsing. I don't believe Tika guarantees taking in an XHTML file and
> parsing it into Tika's
> intermediate XHTML structure the same way that the XHTML file came
> in (i.e., with the tags in the same order).
> 
> Is that your expectation?
> 
> Cheers,
> Chris
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Devaraja Swami <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Monday, September 8, 2014 9:29 PM
> To: "[email protected]" <[email protected]>
> Subject: Re: HTML parsing error with <a> tag inside <h1> tag
> 
> >Hi Chris,
> >
> >
> >Thanks for your reply.
> >
> >
> >To ad more clarity to my original post, I expect that the Tika 1.5
> >HtmlParser should parse any HTML input source and pass along the tags in
> >the order appearing in the HTML source correctly to the downstream (user
> >supplied) SAX content handler.
> >
> >This is not happening currently.
> >
> >
> >
> >For my HTML source, the Tika upstream parser (HtmlParser) that I call
> >using the Tika API is sending the end tag [ endElement() ] of the
> >enclosing <h1> tag to the (my) downstream content handler before it sends
> >along the start tag [ startElement() ] of
> > the enclosed <a> tag.
> >
> >
> >IMHO, this is a clear, and quite serious, upstream parsing error.
> >
> >
> >
> >If possible, could you please shed some light on this, or explain how I
> >can overcome this?
> >If necessary, I can add a JIRA on this.
> >
> >
> >
> >Thanks,
> >Devarajan
> >
> >
> >
> >
> >On Mon, Sep 8, 2014 at 8:55 PM, Chris Mattmann
> ><[email protected]> wrote:
> >
> >Hi Devarajan,
> >
> >Please see Chapter 5 of the Tika in Action book for more
> >detail on this. The short answer is that the parsed XHTML
> >representation of *any* upstream file does not necessarily
> >correspond to the upstream (X)HTML representation of the
> >file. The XHTML is an intermediate format that Tika uses
> >to represent the parsed structure content around the text.
> >That is, if you have the following scenario:
> >
> >PDF->XHTML->content handlers
> >XHTML->XHTML->content handlers
> >Word Doc->XHTML->content handlers
> >Image->XHTML-content handlers
> >..
> >etc
> >
> >Note that XHTML intermediate is the structured representation
> >of the information around the text in the document (including
> >its metadata). That XHTML is then passed into org.sax.xml.ContentHandlers
> >for stream-based processing downstream.
> >
> >Cheers,
> >Chris
> >
> >------------------------
> >Chris Mattmann
> >[email protected]
> >
> >
> >
> >
> >-----Original Message-----
> >From: Devaraja Swami <[email protected]>
> >Reply-To: <[email protected]>
> >Date: Monday, September 8, 2014 7:12 PM
> >To: <[email protected]>
> >Subject: HTML parsing error with <a> tag inside <h1> tag
> >
> >>In the following HTML document, the <a> is inside the <h1> tag which is
> >>inside the <p> tag:
> >>-------------------
> >><!DOCTYPE html>
> >><html>
> >><body>
> >>       <div><h1><a href="http://www.google.com";>GOOGLE!</a></h1></div>
> >></body>
> >></html>
> >>-------------------
> >>But when I parse it with Tika 1.5 HtmlParser,
> >>it adds both the <a> and <h1> tag nodes as direct children of the <p>
> >>tag.
> >>
> >>The same error happens when I replace the <h1> tag with other header tags
> >><h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag.
> >>[Haven't experimented with other replacements].
> >>
> >>This seems to be a basic issue.
> >>Any help would be deeply appreciated.
> >>
> >>Cheers,
> >>Devarajan
> >>
> >>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> 
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to