There was a plugin awhile ago which allowed you to specify different tags
to be indexed or excluded from being indexed if I'm not mistaken it was
this:

http://www.longconnections.com/blog/2015/6/3/using-apache-nutchsolr-to-build-a-search-engine-with-auto-complete-feature

Good luck and please let me know what you come up with, Thank you!

On Fri, Nov 16, 2018 at 10:04 AM <hany.n...@hsbc.com> wrote:

> Anyone was facing this requirement before?
>
> Kind regards,
> Hany Shehata
> Solutions Architect, Marketing and Communications IT
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __________________________________________________________________
>
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __________________________________________________________________
> Protect our environment - please only print this if you have to!
>
>
> -----Original Message-----
> From: Hany NASR
> Sent: Thursday, November 15, 2018 4:18 PM
> To: user@nutch.apache.org
> Subject: RE: Block certain parts of HTML code from being indexed
>
> Hello Markus,
>
> What if I want to remove specific component or page section?
>
> Kind regards,
> Hany Shehata
> Solutions Architect, Marketing and Communications IT Corporate Functions |
> HSBC Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347
> Kraków, Poland
> __________________________________________________________________
>
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __________________________________________________________________
> Protect our environment - please only print this if you have to!
>
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Wednesday, November 14, 2018 4:11 PM
> To: user@nutch.apache.org
> Subject: RE: Block certain parts of HTML code from being indexed
>
> Hello Hany,
>
> Using parse-tika as your HTML parser, you can enable Boilerpipe (see
> nutch-default).
>
> Regards,
> Markus
>
>
>
> -----Original message-----
> > From:hany.n...@hsbc.com <hany.n...@hsbc.com>
> > Sent: Wednesday 14th November 2018 15:53
> > To: user@nutch.apache.org
> > Subject: Block certain parts of HTML code from being indexed
> >
> > Hello All,
> >
> > I am using Nutch 1.15, and wondering if there is a feature for blocking
> certain parts of HTML code from being indexed (header & footer).
> >
> > Kind regards,
> > Hany Shehata
> > Solutions Architect, Marketing and Communications IT Corporate
> > Functions | HSBC Operations, Services and Technology (HOST) ul.
> > Kapelanka 42A, 30-347 Kraków, Poland
> > __________________________________________________________________
> >
> > Tie line: 7148 7689 4698
> > External: +48 123 42 0698
> > Mobile: +48 723 680 278
> > E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>
> > __________________________________________________________________
> > Protect our environment - please only print this if you have to!
> >
> >
> >
> > -----------------------------------------
> > SAVE PAPER - THINK BEFORE YOU PRINT!
> >
> > This E-mail is confidential.
> >
> > It may also be legally privileged. If you are not the addressee you
> > may not copy, forward, disclose or use any part of it. If you have
> > received this message in error, please delete it and all copies from
> > your system and notify the sender immediately by return E-mail.
> >
> > Internet communications cannot be guaranteed to be timely secure, error
> or virus-free.
> > The sender does not accept liability for any errors or omissions.
> >
>
>
> ***************************************************
> This message originated from the Internet. Its originator may or may not
> be who they claim to be and the information contained in the message and
> any attachments may or may not be accurate.
> ****************************************************
>
>
>
>
> -----------------------------------------
> SAVE PAPER - THINK BEFORE YOU PRINT!
>
> This E-mail is confidential.
>
> It may also be legally privileged. If you are not the addressee you may
> not copy,
> forward, disclose or use any part of it. If you have received this message
> in error,
> please delete it and all copies from your system and notify the sender
> immediately by
> return E-mail.
>
> Internet communications cannot be guaranteed to be timely secure, error or
> virus-free.
> The sender does not accept liability for any errors or omissions.
>

Reply via email to