Yes, I realised that once I replied my apologies! If I use nutch's parsechecker I can see ParseText still only extracts just text. I assume this is what gets indexed by the subsequent index operation.
I'd like to index the raw html file and not just the text. I had assumed this would need to be done at the parse stage but I feel you may be about to prove me wrong! Matt On 8 May 2017 8:17 p.m., "Markus Jelsma" <[email protected]> wrote: You mention you're indexing, but HTML is never indexed by default. Is that what you are looking for? The steps i mentioned only involve parsing. Markus -----Original message----- > From:Matt Rutherford <[email protected]> > Sent: Monday 8th May 2017 20:31 > To: [email protected] > Subject: RE: Prevent parsers from stripping html tags > > I uncommented this and the parse-tika plugin in plugin.includes but it > still removed tags when indexing. > > On 8 May 2017 6:57 p.m., "Markus Jelsma" <[email protected]> wrote: > > > Hi - you need an identity mapper for Tika if i remember correctly: > > > > <property> > > <name>tika.htmlmapper.classname</name> > > <value>org.apache.tika.parser.html.IdentityHtmlMapper</value> > > <description>Classname of Tika HTMLMapper to use. Influences the > > elements included in the DOM and hence > > the behavior of the HTMLParseFilters. > > </description> > > </property> > > > > Regards, > > Markus > > > > > > > > -----Original message----- > > > From:Matt Rutherford <[email protected]> > > > Sent: Monday 8th May 2017 19:45 > > > To: [email protected] > > > Subject: Prevent parsers from stripping html tags > > > > > > I would like to maintain the html tags during the parsing stage so they > > > also get indexed. How can I accomplish this? > > > > > > I tried removing the parser plugins (html and tika in my case) but it > > seems > > > you need at least one and enabling either of these strips the markup from > > > the docs. > > > > > >

