More trace data: This is the sequence of startElement and endElement calls from the Tika 1.5 HtmlParser to my downstream content handler: --------------------------------------------------------------------------------------------- STARTED TIKA PARSING
START ELEMENT <html> <http://www.w3.org/1999/xhtml> <html> START ELEMENT <head> <http://www.w3.org/1999/xhtml> <head> START ELEMENT <meta> <http://www.w3.org/1999/xhtml> <meta> ELEMENT ATTRIBUTES <meta>: <[content] ---> [ISO-8859-1], [name] ---> [Content-Encoding]> END ELEMENT <meta> <http://www.w3.org/1999/xhtml> <meta> START ELEMENT <meta> <http://www.w3.org/1999/xhtml> <meta> ELEMENT ATTRIBUTES <meta>: <[content] ---> [text/html; charset=ISO-8859-1], [name] ---> [Content-Type]> END ELEMENT <meta> <http://www.w3.org/1999/xhtml> <meta> START ELEMENT <title> <http://www.w3.org/1999/xhtml> <title> END ELEMENT <title> <http://www.w3.org/1999/xhtml> <title> END ELEMENT <head> <http://www.w3.org/1999/xhtml> <head> START ELEMENT <body> <http://www.w3.org/1999/xhtml> <body> START ELEMENT <address> <http://www.w3.org/1999/xhtml> <address> START ELEMENT <cite> <http://www.w3.org/1999/xhtml> <cite> START ELEMENT <h1> <http://www.w3.org/1999/xhtml> <h1> END ELEMENT <h1> <http://www.w3.org/1999/xhtml> <h1> START ELEMENT <a> <http://www.w3.org/1999/xhtml> <a> ELEMENT ATTRIBUTES <a>: <[href] ---> [http://www.google.com], [shape] ---> [rect]> TEXT <GOOGLE!> END ELEMENT <a> <http://www.w3.org/1999/xhtml> <a> END ELEMENT <cite> <http://www.w3.org/1999/xhtml> <cite> END ELEMENT <address> <http://www.w3.org/1999/xhtml> <address> END ELEMENT <body> <http://www.w3.org/1999/xhtml> <body> END ELEMENT <html> <http://www.w3.org/1999/xhtml> <html> COMPLETED TIKA PARSING --------------------------------------------------------------------------------------------- [I skipped traces for calls to characters(...) which are passing along pure whitespace.] [Also looks like Tika (or TagSoup) is adding a <head> and two meta> tags.] Hope this makes the problem clearer. Cheers, Devarajan On Mon, Sep 8, 2014 at 9:29 PM, Devaraja Swami <[email protected]> wrote: > Hi Chris, > > Thanks for your reply. > > To ad more clarity to my original post, I expect that the Tika 1.5 > HtmlParser should parse any HTML input source and pass along the tags in > the order appearing in the HTML source correctly to the downstream (user > supplied) SAX content handler. > This is not happening currently. > > For my HTML source, the Tika upstream parser (HtmlParser) that I call > using the Tika API is sending the end tag [ endElement() ] of the enclosing > <h1> tag to the (my) downstream content handler before it sends along the > start tag [ startElement() ] of the enclosed <a> tag. > > IMHO, this is a clear, and quite serious, upstream parsing error. > > If possible, could you please shed some light on this, or explain how I > can overcome this? > If necessary, I can add a JIRA on this. > > Thanks, > Devarajan > > > On Mon, Sep 8, 2014 at 8:55 PM, Chris Mattmann <[email protected]> > wrote: > >> Hi Devarajan, >> >> Please see Chapter 5 of the Tika in Action book for more >> detail on this. The short answer is that the parsed XHTML >> representation of *any* upstream file does not necessarily >> correspond to the upstream (X)HTML representation of the >> file. The XHTML is an intermediate format that Tika uses >> to represent the parsed structure content around the text. >> That is, if you have the following scenario: >> >> PDF->XHTML->content handlers >> XHTML->XHTML->content handlers >> Word Doc->XHTML->content handlers >> Image->XHTML-content handlers >> .. >> etc >> >> Note that XHTML intermediate is the structured representation >> of the information around the text in the document (including >> its metadata). That XHTML is then passed into org.sax.xml.ContentHandlers >> for stream-based processing downstream. >> >> Cheers, >> Chris >> >> ------------------------ >> Chris Mattmann >> [email protected] >> >> >> >> >> -----Original Message----- >> From: Devaraja Swami <[email protected]> >> Reply-To: <[email protected]> >> Date: Monday, September 8, 2014 7:12 PM >> To: <[email protected]> >> Subject: HTML parsing error with <a> tag inside <h1> tag >> >> >In the following HTML document, the <a> is inside the <h1> tag which is >> >inside the <p> tag: >> >------------------- >> ><!DOCTYPE html> >> ><html> >> ><body> >> > <div><h1><a href="http://www.google.com">GOOGLE!</a></h1></div> >> ></body> >> ></html> >> >------------------- >> >But when I parse it with Tika 1.5 HtmlParser, >> >it adds both the <a> and <h1> tag nodes as direct children of the <p> >> tag. >> > >> >The same error happens when I replace the <h1> tag with other header tags >> ><h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag. >> >[Haven't experimented with other replacements]. >> > >> >This seems to be a basic issue. >> >Any help would be deeply appreciated. >> > >> >Cheers, >> >Devarajan >> > >> > >> >> >> >
