Markus, thanks for the pointer. So basically all I need to do is uncomment line 
117 in TikaParser.java [1] (Nutch 2.3.1) and I'm done!?

[1] 
https://github.com/apache/nutch/blob/branch-2.3.1/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L117

> -----Ursprüngliche Nachricht-----
> Von: Markus Jelsma [mailto:[email protected]]
> Gesendet: Freitag, 30. September 2016 14:15
> An: [email protected]
> Betreff: RE: Tika removes tags which I'd prefer to keep.
> 
> Hello - Tika does some HTML mapping under the hood, but it is configurable. 
> Tell
> Tika to use the IdentityMapper. I am not sure anymore which param you need,
> check out TikaParser.java, it is somewhere near the bottom.
> 
> Markus
> 
> 
> 
> -----Original message-----
> > From:Felix von Zadow <[email protected]>
> > Sent: Friday 30th September 2016 14:04
> > To: [email protected]
> > Subject: Tika removes tags which I'd prefer to keep.
> >
> >
> > Hi!
> >
> > I'm fairly new to Nutch and I'm having a problem with parse-tika for HTML
> parsing. I searched the archive but couldn't find anything.
> >
> > I would like to use parse-tika for parsing HTML and later indexing to Solr.
> While parsing, tika seems to remove quite a number of HTML tags and
> attributes. While this does not really affect the text content that is later 
> indexed,
> it prevents me from using a parse filter to extract certain information based 
> on
> the existence of certain div-tags. I'm by the way crawling a set of pages 
> that I
> have control over.
> >
> > So my question is: is there a configuration option (or some other way) to
> control how the tika parser will transform the document?
> >
> > Thanks a bunch!
> > Felix
> >

Reply via email to