RE: Tika removes tags which I'd prefer to keep.

Markus Jelsma Fri, 30 Sep 2016 05:49:36 -0700

Felix, i've looked it up. Use the tika.htmlmapper.classname parameter to set 
IdentityHTMLMapper and you should be fine.
Markus


 
 
-----Original message-----
> From:Felix von Zadow <[email protected]>
> Sent: Friday 30th September 2016 14:43
> To: [email protected]
> Subject: AW: Tika removes tags which I'd prefer to keep.
> 
> 
> Markus, thanks for the pointer. So basically all I need to do is uncomment 
> line 117 in TikaParser.java [1] (Nutch 2.3.1) and I'm done!?
> 
> [1] 
> https://github.com/apache/nutch/blob/branch-2.3.1/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L117
> 
> > -----Ursprüngliche Nachricht-----
> > Von: Markus Jelsma [mailto:[email protected]]
> > Gesendet: Freitag, 30. September 2016 14:15
> > An: [email protected]
> > Betreff: RE: Tika removes tags which I'd prefer to keep.
> > 
> > Hello - Tika does some HTML mapping under the hood, but it is configurable. 
> > Tell
> > Tika to use the IdentityMapper. I am not sure anymore which param you need,
> > check out TikaParser.java, it is somewhere near the bottom.
> > 
> > Markus
> > 
> > 
> > 
> > -----Original message-----
> > > From:Felix von Zadow <[email protected]>
> > > Sent: Friday 30th September 2016 14:04
> > > To: [email protected]
> > > Subject: Tika removes tags which I'd prefer to keep.
> > >
> > >
> > > Hi!
> > >
> > > I'm fairly new to Nutch and I'm having a problem with parse-tika for HTML
> > parsing. I searched the archive but couldn't find anything.
> > >
> > > I would like to use parse-tika for parsing HTML and later indexing to 
> > > Solr.
> > While parsing, tika seems to remove quite a number of HTML tags and
> > attributes. While this does not really affect the text content that is 
> > later indexed,
> > it prevents me from using a parse filter to extract certain information 
> > based on
> > the existence of certain div-tags. I'm by the way crawling a set of pages 
> > that I
> > have control over.
> > >
> > > So my question is: is there a configuration option (or some other way) to
> > control how the tika parser will transform the document?
> > >
> > > Thanks a bunch!
> > > Felix
> > >
>

RE: Tika removes tags which I'd prefer to keep.

Reply via email to