Tika removes tags which I'd prefer to keep.

Felix von Zadow Fri, 30 Sep 2016 05:05:11 -0700

Hi!

I'm fairly new to Nutch and I'm having a problem with parse-tika for HTML 
parsing. I searched the archive but couldn't find anything.


I would like to use parse-tika for parsing HTML and later indexing to Solr. 
While parsing, tika seems to remove quite a number of HTML tags and attributes. 
While this does not really affect the text content that is later indexed, it 
prevents me from using a parse filter to extract certain information based on 
the existence of certain div-tags. I'm by the way crawling a set of pages that 
I have control over.

So my question is: is there a configuration option (or some other way) to 
control how the tika parser will transform the document?

Thanks a bunch!
Felix

Tika removes tags which I'd prefer to keep.

Reply via email to