Devarajan,
Ken's answer provides some more detail, so please check that out. Furthermore, I repeat again, I am not sure you are understanding what I'm saying. You are comparing Tika to a SAX compliant parser. Tika is much more than this. This isn't me being "defensive" as you put it below, it's me trying to share the philosophy behind Tika with you. At the end of the day it seems you are interested in a SAX compliant parser. You have a few options there: 1. Use TagSoup and/or NekoHTML and/or <<insert HTML parsing library here>> directly if you need SAX compliant HTML parsing with your concerns abou preserving upstream DOM, etc. 2. Roll your own Parser and add it to Tika through the Java SPI, like the rest of the Tika Parsers are defined. Declare that it supports the (X)HTML MIME type. Cheers, Chris -----Original Message----- From: Devaraja Swami <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Monday, September 8, 2014 10:26 PM To: "[email protected]" <[email protected]> Subject: Re: HTML parsing error with <a> tag inside <h1> tag >That indeed is my, IMHO quite reasonable, expectation! > > >In many content analysis applications, like mine, the resulting DOM >structure is the objective of parsing, not merely a dump of the textual >content. >In these applications, the DOM structure is generated by a user-provided >content handler which accepts the stream of SAX content handler calls >from the Tika parser. As an example [though not my application of >interest], this is exactly how Nutch uses Tika, > where the Nutch-provided content handler is called DOMBuilder. > > >[Since you were a contributor to Nutch before spinning off Tika, I am >sure you can understand its importance :-) ] > > >To shed further light on the problem, and to lighten your defensive >concern, I don't believe Tika source code is jumbling the order of the >tags. >I think it is your upstream parser - TagSoup to be precise: > > >I just ran the same file directly through the latest TagSoup and the >latest NekoHTML. >The former causes the order jumbling above, where the latter faithfully >forwards the incoming tag order. >In fact, TagSoup has other problems, like lack of handling of HTML5 tags, >for which I had to develop workarounds using custom HTML schema class >(similar to the workaround you posted some time ago). > > >So my second question is, is it possible for you to alter Tika so that >the user can specify at runtime the present raw HTML parser (TagSoup or >NekoHTML) to the Tika HtmlParser, and bundle both options in the Tika >dependencies? Failing this, I have to create > an internal hack of the Tika HtmlParser to use NekoHTML instead of >TagSoup. > > >My concern that Tika should indeed guarantee the faithful forwarding of >the incoming order of tags and text [just like the contract for any SAX >compliant parser] still holds though... > > >Cheers, >Devarajan > > > > > > >On Mon, Sep 8, 2014 at 10:01 PM, Mattmann, Chris A (3980) ><[email protected]> wrote: > >Thanks Devarajan. > >I think your expectation below is not the way that Tika handles >parsing. I don't believe Tika guarantees taking in an XHTML file and >parsing it into Tika's >intermediate XHTML structure the same way that the XHTML file came >in (i.e., with the tags in the same order). > >Is that your expectation? > >Cheers, >Chris > > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: [email protected] >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Adjunct Associate Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > >-----Original Message----- >From: Devaraja Swami <[email protected]> >Reply-To: "[email protected]" <[email protected]> >Date: Monday, September 8, 2014 9:29 PM >To: "[email protected]" <[email protected]> >Subject: Re: HTML parsing error with <a> tag inside <h1> tag > >>Hi Chris, >> >> >>Thanks for your reply. >> >> >>To ad more clarity to my original post, I expect that the Tika 1.5 >>HtmlParser should parse any HTML input source and pass along the tags in >>the order appearing in the HTML source correctly to the downstream (user >>supplied) SAX content handler. >> >>This is not happening currently. >> >> >> >>For my HTML source, the Tika upstream parser (HtmlParser) that I call >>using the Tika API is sending the end tag [ endElement() ] of the >>enclosing <h1> tag to the (my) downstream content handler before it sends >>along the start tag [ startElement() ] of >> the enclosed <a> tag. >> >> >>IMHO, this is a clear, and quite serious, upstream parsing error. >> >> >> >>If possible, could you please shed some light on this, or explain how I >>can overcome this? >>If necessary, I can add a JIRA on this. >> >> >> >>Thanks, >>Devarajan >> >> >> >> >>On Mon, Sep 8, 2014 at 8:55 PM, Chris Mattmann >><[email protected]> wrote: >> >>Hi Devarajan, >> >>Please see Chapter 5 of the Tika in Action book for more >>detail on this. The short answer is that the parsed XHTML >>representation of *any* upstream file does not necessarily >>correspond to the upstream (X)HTML representation of the >>file. The XHTML is an intermediate format that Tika uses >>to represent the parsed structure content around the text. >>That is, if you have the following scenario: >> >>PDF->XHTML->content handlers >>XHTML->XHTML->content handlers >>Word Doc->XHTML->content handlers >>Image->XHTML-content handlers >>.. >>etc >> >>Note that XHTML intermediate is the structured representation >>of the information around the text in the document (including >>its metadata). That XHTML is then passed into org.sax.xml.ContentHandlers >>for stream-based processing downstream. >> >>Cheers, >>Chris >> >>------------------------ >>Chris Mattmann >>[email protected] >> >> >> >> >>-----Original Message----- >>From: Devaraja Swami <[email protected]> >>Reply-To: <[email protected]> >>Date: Monday, September 8, 2014 7:12 PM >>To: <[email protected]> >>Subject: HTML parsing error with <a> tag inside <h1> tag >> >>>In the following HTML document, the <a> is inside the <h1> tag which is >>>inside the <p> tag: >>>------------------- >>><!DOCTYPE html> >>><html> >>><body> >>> <div><h1><a href="http://www.google.com">GOOGLE!</a></h1></div> >>></body> >>></html> >>>------------------- >>>But when I parse it with Tika 1.5 HtmlParser, >>>it adds both the <a> and <h1> tag nodes as direct children of the <p> >>>tag. >>> >>>The same error happens when I replace the <h1> tag with other header >>>tags >>><h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag. >>>[Haven't experimented with other replacements]. >>> >>>This seems to be a basic issue. >>>Any help would be deeply appreciated. >>> >>>Cheers, >>>Devarajan >>> >>> >> >> >> >> >> >> >> >> >> > > > > > > > >
