Hello,

To check what is going to be indexed, use the indexchecker command, it gives a 
more precise view on what is going to be sent. If you want to index the HTML, 
use the -addBinaryContent on the index command. It adds a field (forgot its 
name) with the contents of the Content directory of the segment

Check the ticket [1] for more information.

Regards,
Markus

[1] https://issues.apache.org/jira/browse/NUTCH-1785
 
-----Original message-----
> From:Matt Rutherford <[email protected]>
> Sent: Monday 8th May 2017 21:33
> To: [email protected]
> Subject: RE: Prevent parsers from stripping html tags
> 
> Yes, I realised that once I replied my apologies!
> 
> If I use nutch's parsechecker I can see ParseText still only extracts just
> text. I assume this is what gets indexed by the subsequent index operation.
> 
> I'd like to index the raw html file and not just the text. I had assumed
> this would need to be done at the parse stage but I feel you may be about
> to prove me wrong!
> 
> Matt
> 
> 
> On 8 May 2017 8:17 p.m., "Markus Jelsma" <[email protected]> wrote:
> 
> You mention you're indexing, but HTML is never indexed by default. Is that
> what you are looking for? The steps i mentioned only involve parsing.
> 
> Markus
> 
> 
> 
> -----Original message-----
> > From:Matt Rutherford <[email protected]>
> > Sent: Monday 8th May 2017 20:31
> > To: [email protected]
> > Subject: RE: Prevent parsers from stripping html tags
> >
> > I uncommented this and the parse-tika plugin in plugin.includes but it
> > still removed tags when indexing.
> >
> > On 8 May 2017 6:57 p.m., "Markus Jelsma" <[email protected]>
> wrote:
> >
> > > Hi - you need an identity mapper for Tika if i remember correctly:
> > >
> > > <property>
> > >   <name>tika.htmlmapper.classname</name>
> > >   <value>org.apache.tika.parser.html.IdentityHtmlMapper</value>
> > >   <description>Classname of Tika HTMLMapper to use. Influences the
> > > elements included in the DOM and hence
> > >   the behavior of the HTMLParseFilters.
> > >   </description>
> > > </property>
> > >
> > > Regards,
> > > Markus
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Matt Rutherford <[email protected]>
> > > > Sent: Monday 8th May 2017 19:45
> > > > To: [email protected]
> > > > Subject: Prevent parsers from stripping html tags
> > > >
> > > > I would like to maintain the html tags during the parsing stage so
> they
> > > > also get indexed. How can I accomplish this?
> > > >
> > > > I tried removing the parser plugins (html and tika in my case) but it
> > > seems
> > > > you need at least one and enabling either of these strips the markup
> from
> > > > the docs.
> > > >
> > >
> >
> 

Reply via email to