RE: Prevent parsers from stripping html tags

Markus Jelsma Mon, 08 May 2017 12:18:06 -0700

You mention you're indexing, but HTML is never indexed by default. Is that what 
you are looking for? The steps i mentioned only involve parsing.


Markus

 
 
-----Original message-----
> From:Matt Rutherford <[email protected]>
> Sent: Monday 8th May 2017 20:31
> To: [email protected]
> Subject: RE: Prevent parsers from stripping html tags
> 
> I uncommented this and the parse-tika plugin in plugin.includes but it
> still removed tags when indexing.
> 
> On 8 May 2017 6:57 p.m., "Markus Jelsma" <[email protected]> wrote:
> 
> > Hi - you need an identity mapper for Tika if i remember correctly:
> >
> > <property>
> >   <name>tika.htmlmapper.classname</name>
> >   <value>org.apache.tika.parser.html.IdentityHtmlMapper</value>
> >   <description>Classname of Tika HTMLMapper to use. Influences the
> > elements included in the DOM and hence
> >   the behavior of the HTMLParseFilters.
> >   </description>
> > </property>
> >
> > Regards,
> > Markus
> >
> >
> >
> > -----Original message-----
> > > From:Matt Rutherford <[email protected]>
> > > Sent: Monday 8th May 2017 19:45
> > > To: [email protected]
> > > Subject: Prevent parsers from stripping html tags
> > >
> > > I would like to maintain the html tags during the parsing stage so they
> > > also get indexed. How can I accomplish this?
> > >
> > > I tried removing the parser plugins (html and tika in my case) but it
> > seems
> > > you need at least one and enabling either of these strips the markup from
> > > the docs.
> > >
> >
>

RE: Prevent parsers from stripping html tags

Reply via email to