Markus, thanks for the pointer. So basically all I need to do is uncomment line 117 in TikaParser.java [1] (Nutch 2.3.1) and I'm done!?
[1] https://github.com/apache/nutch/blob/branch-2.3.1/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L117 > -----Ursprüngliche Nachricht----- > Von: Markus Jelsma [mailto:[email protected]] > Gesendet: Freitag, 30. September 2016 14:15 > An: [email protected] > Betreff: RE: Tika removes tags which I'd prefer to keep. > > Hello - Tika does some HTML mapping under the hood, but it is configurable. > Tell > Tika to use the IdentityMapper. I am not sure anymore which param you need, > check out TikaParser.java, it is somewhere near the bottom. > > Markus > > > > -----Original message----- > > From:Felix von Zadow <[email protected]> > > Sent: Friday 30th September 2016 14:04 > > To: [email protected] > > Subject: Tika removes tags which I'd prefer to keep. > > > > > > Hi! > > > > I'm fairly new to Nutch and I'm having a problem with parse-tika for HTML > parsing. I searched the archive but couldn't find anything. > > > > I would like to use parse-tika for parsing HTML and later indexing to Solr. > While parsing, tika seems to remove quite a number of HTML tags and > attributes. While this does not really affect the text content that is later > indexed, > it prevents me from using a parse filter to extract certain information based > on > the existence of certain div-tags. I'm by the way crawling a set of pages > that I > have control over. > > > > So my question is: is there a configuration option (or some other way) to > control how the tika parser will transform the document? > > > > Thanks a bunch! > > Felix > >

