Hello - Tika does some HTML mapping under the hood, but it is configurable. 
Tell Tika to use the IdentityMapper. I am not sure anymore which param you 
need, check out TikaParser.java, it is somewhere near the bottom.

Markus

 
 
-----Original message-----
> From:Felix von Zadow <[email protected]>
> Sent: Friday 30th September 2016 14:04
> To: [email protected]
> Subject: Tika removes tags which I'd prefer to keep.
> 
> 
> Hi!
> 
> I'm fairly new to Nutch and I'm having a problem with parse-tika for HTML 
> parsing. I searched the archive but couldn't find anything.
> 
> I would like to use parse-tika for parsing HTML and later indexing to Solr. 
> While parsing, tika seems to remove quite a number of HTML tags and 
> attributes. While this does not really affect the text content that is later 
> indexed, it prevents me from using a parse filter to extract certain 
> information based on the existence of certain div-tags. I'm by the way 
> crawling a set of pages that I have control over.
> 
> So my question is: is there a configuration option (or some other way) to 
> control how the tika parser will transform the document?
> 
> Thanks a bunch!
> Felix
> 

Reply via email to