Felix, i've looked it up. Use the tika.htmlmapper.classname parameter to set IdentityHTMLMapper and you should be fine. Markus
-----Original message----- > From:Felix von Zadow <[email protected]> > Sent: Friday 30th September 2016 14:43 > To: [email protected] > Subject: AW: Tika removes tags which I'd prefer to keep. > > > Markus, thanks for the pointer. So basically all I need to do is uncomment > line 117 in TikaParser.java [1] (Nutch 2.3.1) and I'm done!? > > [1] > https://github.com/apache/nutch/blob/branch-2.3.1/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L117 > > > -----Ursprüngliche Nachricht----- > > Von: Markus Jelsma [mailto:[email protected]] > > Gesendet: Freitag, 30. September 2016 14:15 > > An: [email protected] > > Betreff: RE: Tika removes tags which I'd prefer to keep. > > > > Hello - Tika does some HTML mapping under the hood, but it is configurable. > > Tell > > Tika to use the IdentityMapper. I am not sure anymore which param you need, > > check out TikaParser.java, it is somewhere near the bottom. > > > > Markus > > > > > > > > -----Original message----- > > > From:Felix von Zadow <[email protected]> > > > Sent: Friday 30th September 2016 14:04 > > > To: [email protected] > > > Subject: Tika removes tags which I'd prefer to keep. > > > > > > > > > Hi! > > > > > > I'm fairly new to Nutch and I'm having a problem with parse-tika for HTML > > parsing. I searched the archive but couldn't find anything. > > > > > > I would like to use parse-tika for parsing HTML and later indexing to > > > Solr. > > While parsing, tika seems to remove quite a number of HTML tags and > > attributes. While this does not really affect the text content that is > > later indexed, > > it prevents me from using a parse filter to extract certain information > > based on > > the existence of certain div-tags. I'm by the way crawling a set of pages > > that I > > have control over. > > > > > > So my question is: is there a configuration option (or some other way) to > > control how the tika parser will transform the document? > > > > > > Thanks a bunch! > > > Felix > > > >

