Hi Devarajan, You are correct that the issue is with TagSoup, where it assumes you can't have an anchor (<a>) element inside of a header (<h1>, <h2>, etc) element.
And yes, TagSoup is missing support for HTML5 tags; I believe Markus Jelsma was trying to get that fixed, but I don't think he's had much luck with the TagSoup author. Originally Tika did use NekoHTML, but that was replaced by TagSoup back in 2009. See https://issues.apache.org/jira/browse/TIKA-310 for details. I haven't looked at how hard it would be to make the HTML parser pluggable. Patches welcome :) Though as a first cut, you could create your own parser that clones the current HTML support and does a hard-coded replacement, for testing purposes. One final point - since Tika tries to guarantee XHTML 1.0-compliant output, you cannot assume that whatever you put into Tika will give you a corresponding DOM. -- Ken > From: Devaraja Swami > Sent: September 8, 2014 10:26:24pm PDT > To: [email protected] > Subject: Re: HTML parsing error with <a> tag inside <h1> tag > > That indeed is my, IMHO quite reasonable, expectation! > > In many content analysis applications, like mine, the resulting DOM structure > is the objective of parsing, not merely a dump of the textual content. > In these applications, the DOM structure is generated by a user-provided > content handler which accepts the stream of SAX content handler calls from > the Tika parser. As an example [though not my application of interest], this > is exactly how Nutch uses Tika, where the Nutch-provided content handler is > called DOMBuilder. > > [Since you were a contributor to Nutch before spinning off Tika, I am sure > you can understand its importance :-) ] > > To shed further light on the problem, and to lighten your defensive concern, > I don't believe Tika source code is jumbling the order of the tags. > I think it is your upstream parser - TagSoup to be precise: > > I just ran the same file directly through the latest TagSoup and the latest > NekoHTML. > The former causes the order jumbling above, where the latter faithfully > forwards the incoming tag order. > In fact, TagSoup has other problems, like lack of handling of HTML5 tags, for > which I had to develop workarounds using custom HTML schema class (similar to > the workaround you posted some time ago). > > So my second question is, is it possible for you to alter Tika so that the > user can specify at runtime the present raw HTML parser (TagSoup or NekoHTML) > to the Tika HtmlParser, and bundle both options in the Tika dependencies? > Failing this, I have to create an internal hack of the Tika HtmlParser to use > NekoHTML instead of TagSoup. > > My concern that Tika should indeed guarantee the faithful forwarding of the > incoming order of tags and text [just like the contract for any SAX compliant > parser] still holds though... > > Cheers, > Devarajan > > > > On Mon, Sep 8, 2014 at 10:01 PM, Mattmann, Chris A (3980) > <[email protected]> wrote: > Thanks Devarajan. > > I think your expectation below is not the way that Tika handles > parsing. I don't believe Tika guarantees taking in an XHTML file and > parsing it into Tika's > intermediate XHTML structure the same way that the XHTML file came > in (i.e., with the tags in the same order). > > Is that your expectation? > > Cheers, > Chris > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > -----Original Message----- > From: Devaraja Swami <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Monday, September 8, 2014 9:29 PM > To: "[email protected]" <[email protected]> > Subject: Re: HTML parsing error with <a> tag inside <h1> tag > > >Hi Chris, > > > > > >Thanks for your reply. > > > > > >To ad more clarity to my original post, I expect that the Tika 1.5 > >HtmlParser should parse any HTML input source and pass along the tags in > >the order appearing in the HTML source correctly to the downstream (user > >supplied) SAX content handler. > > > >This is not happening currently. > > > > > > > >For my HTML source, the Tika upstream parser (HtmlParser) that I call > >using the Tika API is sending the end tag [ endElement() ] of the > >enclosing <h1> tag to the (my) downstream content handler before it sends > >along the start tag [ startElement() ] of > > the enclosed <a> tag. > > > > > >IMHO, this is a clear, and quite serious, upstream parsing error. > > > > > > > >If possible, could you please shed some light on this, or explain how I > >can overcome this? > >If necessary, I can add a JIRA on this. > > > > > > > >Thanks, > >Devarajan > > > > > > > > > >On Mon, Sep 8, 2014 at 8:55 PM, Chris Mattmann > ><[email protected]> wrote: > > > >Hi Devarajan, > > > >Please see Chapter 5 of the Tika in Action book for more > >detail on this. The short answer is that the parsed XHTML > >representation of *any* upstream file does not necessarily > >correspond to the upstream (X)HTML representation of the > >file. The XHTML is an intermediate format that Tika uses > >to represent the parsed structure content around the text. > >That is, if you have the following scenario: > > > >PDF->XHTML->content handlers > >XHTML->XHTML->content handlers > >Word Doc->XHTML->content handlers > >Image->XHTML-content handlers > >.. > >etc > > > >Note that XHTML intermediate is the structured representation > >of the information around the text in the document (including > >its metadata). That XHTML is then passed into org.sax.xml.ContentHandlers > >for stream-based processing downstream. > > > >Cheers, > >Chris > > > >------------------------ > >Chris Mattmann > >[email protected] > > > > > > > > > >-----Original Message----- > >From: Devaraja Swami <[email protected]> > >Reply-To: <[email protected]> > >Date: Monday, September 8, 2014 7:12 PM > >To: <[email protected]> > >Subject: HTML parsing error with <a> tag inside <h1> tag > > > >>In the following HTML document, the <a> is inside the <h1> tag which is > >>inside the <p> tag: > >>------------------- > >><!DOCTYPE html> > >><html> > >><body> > >> <div><h1><a href="http://www.google.com">GOOGLE!</a></h1></div> > >></body> > >></html> > >>------------------- > >>But when I parse it with Tika 1.5 HtmlParser, > >>it adds both the <a> and <h1> tag nodes as direct children of the <p> > >>tag. > >> > >>The same error happens when I replace the <h1> tag with other header tags > >><h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag. > >>[Haven't experimented with other replacements]. > >> > >>This seems to be a basic issue. > >>Any help would be deeply appreciated. > >> > >>Cheers, > >>Devarajan > >> > >> > > > > > > > > > > > > > > > > > > > > -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
