Also, you still need that Tika mapper setting. If not specified, Tika will remap tags to something more generic, making parser plugins that process tags unusable. Same goes for indexing HTML.
Markus -----Original message----- > From:Markus Jelsma <[email protected]> > Sent: Monday 8th May 2017 21:55 > To: [email protected] > Subject: RE: Prevent parsers from stripping html tags > > Hello, > > To check what is going to be indexed, use the indexchecker command, it gives > a more precise view on what is going to be sent. If you want to index the > HTML, use the -addBinaryContent on the index command. It adds a field (forgot > its name) with the contents of the Content directory of the segment > > Check the ticket [1] for more information. > > Regards, > Markus > > [1] https://issues.apache.org/jira/browse/NUTCH-1785 > > -----Original message----- > > From:Matt Rutherford <[email protected]> > > Sent: Monday 8th May 2017 21:33 > > To: [email protected] > > Subject: RE: Prevent parsers from stripping html tags > > > > Yes, I realised that once I replied my apologies! > > > > If I use nutch's parsechecker I can see ParseText still only extracts just > > text. I assume this is what gets indexed by the subsequent index operation. > > > > I'd like to index the raw html file and not just the text. I had assumed > > this would need to be done at the parse stage but I feel you may be about > > to prove me wrong! > > > > Matt > > > > > > On 8 May 2017 8:17 p.m., "Markus Jelsma" <[email protected]> wrote: > > > > You mention you're indexing, but HTML is never indexed by default. Is that > > what you are looking for? The steps i mentioned only involve parsing. > > > > Markus > > > > > > > > -----Original message----- > > > From:Matt Rutherford <[email protected]> > > > Sent: Monday 8th May 2017 20:31 > > > To: [email protected] > > > Subject: RE: Prevent parsers from stripping html tags > > > > > > I uncommented this and the parse-tika plugin in plugin.includes but it > > > still removed tags when indexing. > > > > > > On 8 May 2017 6:57 p.m., "Markus Jelsma" <[email protected]> > > wrote: > > > > > > > Hi - you need an identity mapper for Tika if i remember correctly: > > > > > > > > <property> > > > > <name>tika.htmlmapper.classname</name> > > > > <value>org.apache.tika.parser.html.IdentityHtmlMapper</value> > > > > <description>Classname of Tika HTMLMapper to use. Influences the > > > > elements included in the DOM and hence > > > > the behavior of the HTMLParseFilters. > > > > </description> > > > > </property> > > > > > > > > Regards, > > > > Markus > > > > > > > > > > > > > > > > -----Original message----- > > > > > From:Matt Rutherford <[email protected]> > > > > > Sent: Monday 8th May 2017 19:45 > > > > > To: [email protected] > > > > > Subject: Prevent parsers from stripping html tags > > > > > > > > > > I would like to maintain the html tags during the parsing stage so > > they > > > > > also get indexed. How can I accomplish this? > > > > > > > > > > I tried removing the parser plugins (html and tika in my case) but it > > > > seems > > > > > you need at least one and enabling either of these strips the markup > > from > > > > > the docs. > > > > > > > > > > > > > > >

