Hello - Tika does some HTML mapping under the hood, but it is configurable. Tell Tika to use the IdentityMapper. I am not sure anymore which param you need, check out TikaParser.java, it is somewhere near the bottom.
Markus -----Original message----- > From:Felix von Zadow <[email protected]> > Sent: Friday 30th September 2016 14:04 > To: [email protected] > Subject: Tika removes tags which I'd prefer to keep. > > > Hi! > > I'm fairly new to Nutch and I'm having a problem with parse-tika for HTML > parsing. I searched the archive but couldn't find anything. > > I would like to use parse-tika for parsing HTML and later indexing to Solr. > While parsing, tika seems to remove quite a number of HTML tags and > attributes. While this does not really affect the text content that is later > indexed, it prevents me from using a parse filter to extract certain > information based on the existence of certain div-tags. I'm by the way > crawling a set of pages that I have control over. > > So my question is: is there a configuration option (or some other way) to > control how the tika parser will transform the document? > > Thanks a bunch! > Felix >

