AW: Tika removes tags which I'd prefer to keep.

Felix von Zadow Fri, 30 Sep 2016 06:11:44 -0700

Thanks loads Markus!
I must have scrolled past that part of nutch-default.xml a hundred times and 
somehow even managed to overlook it when searching for "tika".


Not only doesn't tika chop up my html anymore, I now also know about tika's 
HtmlMapper and might implement my own if necessary.

Thanks again!
Felix

> -----Ursprüngliche Nachricht-----
> Von: Markus Jelsma [mailto:[email protected]]
> Gesendet: Freitag, 30. September 2016 14:49
> An: [email protected]
> Betreff: RE: Tika removes tags which I'd prefer to keep.
> 
> Felix, i've looked it up. Use the tika.htmlmapper.classname parameter to set
> IdentityHTMLMapper and you should be fine.
> Markus
> 
> 
> 
> -----Original message-----
> > From:Felix von Zadow <[email protected]>
> > Sent: Friday 30th September 2016 14:43
> > To: [email protected]
> > Subject: AW: Tika removes tags which I'd prefer to keep.
> >
> >
> > Markus, thanks for the pointer. So basically all I need to do is uncomment 
> > line
> 117 in TikaParser.java [1] (Nutch 2.3.1) and I'm done!?
> >
> > [1]
> > https://github.com/apache/nutch/blob/branch-2.3.1/src/plugin/parse-tik
> > a/src/java/org/apache/nutch/parse/tika/TikaParser.java#L117
> >
> > > -----Ursprüngliche Nachricht-----
> > > Von: Markus Jelsma [mailto:[email protected]]
> > > Gesendet: Freitag, 30. September 2016 14:15
> > > An: [email protected]
> > > Betreff: RE: Tika removes tags which I'd prefer to keep.
> > >
> > > Hello - Tika does some HTML mapping under the hood, but it is
> > > configurable. Tell Tika to use the IdentityMapper. I am not sure
> > > anymore which param you need, check out TikaParser.java, it is somewhere
> near the bottom.
> > >
> > > Markus
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Felix von Zadow <[email protected]>
> > > > Sent: Friday 30th September 2016 14:04
> > > > To: [email protected]
> > > > Subject: Tika removes tags which I'd prefer to keep.
> > > >
> > > >
> > > > Hi!
> > > >
> > > > I'm fairly new to Nutch and I'm having a problem with parse-tika
> > > > for HTML
> > > parsing. I searched the archive but couldn't find anything.
> > > >
> > > > I would like to use parse-tika for parsing HTML and later indexing to 
> > > > Solr.
> > > While parsing, tika seems to remove quite a number of HTML tags and
> > > attributes. While this does not really affect the text content that
> > > is later indexed, it prevents me from using a parse filter to
> > > extract certain information based on the existence of certain
> > > div-tags. I'm by the way crawling a set of pages that I have control over.
> > > >
> > > > So my question is: is there a configuration option (or some other
> > > > way) to
> > > control how the tika parser will transform the document?
> > > >
> > > > Thanks a bunch!
> > > > Felix
> > > >
> >

AW: Tika removes tags which I'd prefer to keep.

Reply via email to