Hi! I'm fairly new to Nutch and I'm having a problem with parse-tika for HTML parsing. I searched the archive but couldn't find anything.
I would like to use parse-tika for parsing HTML and later indexing to Solr. While parsing, tika seems to remove quite a number of HTML tags and attributes. While this does not really affect the text content that is later indexed, it prevents me from using a parse filter to extract certain information based on the existence of certain div-tags. I'm by the way crawling a set of pages that I have control over. So my question is: is there a configuration option (or some other way) to control how the tika parser will transform the document? Thanks a bunch! Felix

