Hello, To check what is going to be indexed, use the indexchecker command, it gives a more precise view on what is going to be sent. If you want to index the HTML, use the -addBinaryContent on the index command. It adds a field (forgot its name) with the contents of the Content directory of the segment
Check the ticket [1] for more information. Regards, Markus [1] https://issues.apache.org/jira/browse/NUTCH-1785 -----Original message----- > From:Matt Rutherford <[email protected]> > Sent: Monday 8th May 2017 21:33 > To: [email protected] > Subject: RE: Prevent parsers from stripping html tags > > Yes, I realised that once I replied my apologies! > > If I use nutch's parsechecker I can see ParseText still only extracts just > text. I assume this is what gets indexed by the subsequent index operation. > > I'd like to index the raw html file and not just the text. I had assumed > this would need to be done at the parse stage but I feel you may be about > to prove me wrong! > > Matt > > > On 8 May 2017 8:17 p.m., "Markus Jelsma" <[email protected]> wrote: > > You mention you're indexing, but HTML is never indexed by default. Is that > what you are looking for? The steps i mentioned only involve parsing. > > Markus > > > > -----Original message----- > > From:Matt Rutherford <[email protected]> > > Sent: Monday 8th May 2017 20:31 > > To: [email protected] > > Subject: RE: Prevent parsers from stripping html tags > > > > I uncommented this and the parse-tika plugin in plugin.includes but it > > still removed tags when indexing. > > > > On 8 May 2017 6:57 p.m., "Markus Jelsma" <[email protected]> > wrote: > > > > > Hi - you need an identity mapper for Tika if i remember correctly: > > > > > > <property> > > > <name>tika.htmlmapper.classname</name> > > > <value>org.apache.tika.parser.html.IdentityHtmlMapper</value> > > > <description>Classname of Tika HTMLMapper to use. Influences the > > > elements included in the DOM and hence > > > the behavior of the HTMLParseFilters. > > > </description> > > > </property> > > > > > > Regards, > > > Markus > > > > > > > > > > > > -----Original message----- > > > > From:Matt Rutherford <[email protected]> > > > > Sent: Monday 8th May 2017 19:45 > > > > To: [email protected] > > > > Subject: Prevent parsers from stripping html tags > > > > > > > > I would like to maintain the html tags during the parsing stage so > they > > > > also get indexed. How can I accomplish this? > > > > > > > > I tried removing the parser plugins (html and tika in my case) but it > > > seems > > > > you need at least one and enabling either of these strips the markup > from > > > > the docs. > > > > > > > > > >

