Thanks loads Markus! I must have scrolled past that part of nutch-default.xml a hundred times and somehow even managed to overlook it when searching for "tika".
Not only doesn't tika chop up my html anymore, I now also know about tika's HtmlMapper and might implement my own if necessary. Thanks again! Felix > -----Ursprüngliche Nachricht----- > Von: Markus Jelsma [mailto:[email protected]] > Gesendet: Freitag, 30. September 2016 14:49 > An: [email protected] > Betreff: RE: Tika removes tags which I'd prefer to keep. > > Felix, i've looked it up. Use the tika.htmlmapper.classname parameter to set > IdentityHTMLMapper and you should be fine. > Markus > > > > -----Original message----- > > From:Felix von Zadow <[email protected]> > > Sent: Friday 30th September 2016 14:43 > > To: [email protected] > > Subject: AW: Tika removes tags which I'd prefer to keep. > > > > > > Markus, thanks for the pointer. So basically all I need to do is uncomment > > line > 117 in TikaParser.java [1] (Nutch 2.3.1) and I'm done!? > > > > [1] > > https://github.com/apache/nutch/blob/branch-2.3.1/src/plugin/parse-tik > > a/src/java/org/apache/nutch/parse/tika/TikaParser.java#L117 > > > > > -----Ursprüngliche Nachricht----- > > > Von: Markus Jelsma [mailto:[email protected]] > > > Gesendet: Freitag, 30. September 2016 14:15 > > > An: [email protected] > > > Betreff: RE: Tika removes tags which I'd prefer to keep. > > > > > > Hello - Tika does some HTML mapping under the hood, but it is > > > configurable. Tell Tika to use the IdentityMapper. I am not sure > > > anymore which param you need, check out TikaParser.java, it is somewhere > near the bottom. > > > > > > Markus > > > > > > > > > > > > -----Original message----- > > > > From:Felix von Zadow <[email protected]> > > > > Sent: Friday 30th September 2016 14:04 > > > > To: [email protected] > > > > Subject: Tika removes tags which I'd prefer to keep. > > > > > > > > > > > > Hi! > > > > > > > > I'm fairly new to Nutch and I'm having a problem with parse-tika > > > > for HTML > > > parsing. I searched the archive but couldn't find anything. > > > > > > > > I would like to use parse-tika for parsing HTML and later indexing to > > > > Solr. > > > While parsing, tika seems to remove quite a number of HTML tags and > > > attributes. While this does not really affect the text content that > > > is later indexed, it prevents me from using a parse filter to > > > extract certain information based on the existence of certain > > > div-tags. I'm by the way crawling a set of pages that I have control over. > > > > > > > > So my question is: is there a configuration option (or some other > > > > way) to > > > control how the tika parser will transform the document? > > > > > > > > Thanks a bunch! > > > > Felix > > > > > >

