RE: Prevent parsers from stripping html tags

Matt Rutherford Mon, 08 May 2017 12:33:29 -0700

Yes, I realised that once I replied my apologies!

If I use nutch's parsechecker I can see ParseText still only extracts just
text. I assume this is what gets indexed by the subsequent index operation.


I'd like to index the raw html file and not just the text. I had assumed
this would need to be done at the parse stage but I feel you may be about
to prove me wrong!

Matt


On 8 May 2017 8:17 p.m., "Markus Jelsma" <[email protected]> wrote:

You mention you're indexing, but HTML is never indexed by default. Is that
what you are looking for? The steps i mentioned only involve parsing.

Markus



-----Original message-----
> From:Matt Rutherford <[email protected]>
> Sent: Monday 8th May 2017 20:31
> To: [email protected]
> Subject: RE: Prevent parsers from stripping html tags
>
> I uncommented this and the parse-tika plugin in plugin.includes but it
> still removed tags when indexing.
>
> On 8 May 2017 6:57 p.m., "Markus Jelsma" <[email protected]>
wrote:
>
> > Hi - you need an identity mapper for Tika if i remember correctly:
> >
> > <property>
> >   <name>tika.htmlmapper.classname</name>
> >   <value>org.apache.tika.parser.html.IdentityHtmlMapper</value>
> >   <description>Classname of Tika HTMLMapper to use. Influences the
> > elements included in the DOM and hence
> >   the behavior of the HTMLParseFilters.
> >   </description>
> > </property>
> >
> > Regards,
> > Markus
> >
> >
> >
> > -----Original message-----
> > > From:Matt Rutherford <[email protected]>
> > > Sent: Monday 8th May 2017 19:45
> > > To: [email protected]
> > > Subject: Prevent parsers from stripping html tags
> > >
> > > I would like to maintain the html tags during the parsing stage so
they
> > > also get indexed. How can I accomplish this?
> > >
> > > I tried removing the parser plugins (html and tika in my case) but it
> > seems
> > > you need at least one and enabling either of these strips the markup
from
> > > the docs.
> > >
> >
>

RE: Prevent parsers from stripping html tags

Reply via email to