Thanks Devarajan. I think your expectation below is not the way that Tika handles parsing. I don't believe Tika guarantees taking in an XHTML file and parsing it into Tika's intermediate XHTML structure the same way that the XHTML file came in (i.e., with the tags in the same order).
Is that your expectation? Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Devaraja Swami <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Monday, September 8, 2014 9:29 PM To: "[email protected]" <[email protected]> Subject: Re: HTML parsing error with <a> tag inside <h1> tag >Hi Chris, > > >Thanks for your reply. > > >To ad more clarity to my original post, I expect that the Tika 1.5 >HtmlParser should parse any HTML input source and pass along the tags in >the order appearing in the HTML source correctly to the downstream (user >supplied) SAX content handler. > >This is not happening currently. > > > >For my HTML source, the Tika upstream parser (HtmlParser) that I call >using the Tika API is sending the end tag [ endElement() ] of the >enclosing <h1> tag to the (my) downstream content handler before it sends >along the start tag [ startElement() ] of > the enclosed <a> tag. > > >IMHO, this is a clear, and quite serious, upstream parsing error. > > > >If possible, could you please shed some light on this, or explain how I >can overcome this? >If necessary, I can add a JIRA on this. > > > >Thanks, >Devarajan > > > > >On Mon, Sep 8, 2014 at 8:55 PM, Chris Mattmann ><[email protected]> wrote: > >Hi Devarajan, > >Please see Chapter 5 of the Tika in Action book for more >detail on this. The short answer is that the parsed XHTML >representation of *any* upstream file does not necessarily >correspond to the upstream (X)HTML representation of the >file. The XHTML is an intermediate format that Tika uses >to represent the parsed structure content around the text. >That is, if you have the following scenario: > >PDF->XHTML->content handlers >XHTML->XHTML->content handlers >Word Doc->XHTML->content handlers >Image->XHTML-content handlers >.. >etc > >Note that XHTML intermediate is the structured representation >of the information around the text in the document (including >its metadata). That XHTML is then passed into org.sax.xml.ContentHandlers >for stream-based processing downstream. > >Cheers, >Chris > >------------------------ >Chris Mattmann >[email protected] > > > > >-----Original Message----- >From: Devaraja Swami <[email protected]> >Reply-To: <[email protected]> >Date: Monday, September 8, 2014 7:12 PM >To: <[email protected]> >Subject: HTML parsing error with <a> tag inside <h1> tag > >>In the following HTML document, the <a> is inside the <h1> tag which is >>inside the <p> tag: >>------------------- >><!DOCTYPE html> >><html> >><body> >> <div><h1><a href="http://www.google.com">GOOGLE!</a></h1></div> >></body> >></html> >>------------------- >>But when I parse it with Tika 1.5 HtmlParser, >>it adds both the <a> and <h1> tag nodes as direct children of the <p> >>tag. >> >>The same error happens when I replace the <h1> tag with other header tags >><h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag. >>[Haven't experimented with other replacements]. >> >>This seems to be a basic issue. >>Any help would be deeply appreciated. >> >>Cheers, >>Devarajan >> >> > > > > > > > > >
