Re: Block certain parts of HTML code from being indexed

2018-11-16 Thread Semyon Semyonov
Hi Hany,
 
There is another (dirty) solution, you can modify the content during parsing if 
you dont need it at all. It is probably not like you should do it, but you can 
if you really want.  

For example, modify/delete Node values 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
at private boolean getTextHelper(StringBuffer sb, Node node, boolean 
abortOnNestedAnchors, int anchorDepth) 


Semyon

Sent: Friday, November 16, 2018 at 10:34 AM
From: "Jorge Betancourt" 
To: user@nutch.apache.org
Subject: Re: Block certain parts of HTML code from being indexed
Hi Hany,

As BlackIce said, there is an open issue on
https://issues.apache.org/jira/browse/NUTCH-585 specifically the
(blacklist_whitelist_plugin) by now I'm not sure (probably not) that the
patch can be applied directly to master, but should provide a good general
idea on how to write a custom plugin for removing specific HTML nodes from
the crawl.

Hope it helps,
Jorge

On Fri, Nov 16, 2018 at 10:30 AM BlackIce  wrote:

> There was a plugin awhile ago which allowed you to specify different tags
> to be indexed or excluded from being indexed if I'm not mistaken it was
> this:
>
>
> http://www.longconnections.com/blog/2015/6/3/using-apache-nutchsolr-to-build-a-search-engine-with-auto-complete-feature[http://www.longconnections.com/blog/2015/6/3/using-apache-nutchsolr-to-build-a-search-engine-with-auto-complete-feature]
>
> Good luck and please let me know what you come up with, Thank you!
>
> On Fri, Nov 16, 2018 at 10:04 AM  wrote:
>
> > Anyone was facing this requirement before?
> >
> > Kind regards,
> > Hany Shehata
> > Solutions Architect, Marketing and Communications IT
> > Corporate Functions | HSBC Operations, Services and Technology (HOST)
> > ul. Kapelanka 42A, 30-347 Kraków, Poland
> > __
> >
> > Tie line: 7148 7689 4698
> > External: +48 123 42 0698
> > Mobile: +48 723 680 278
> > E-mail: hany.n...@hsbc.com
> > __
> > Protect our environment - please only print this if you have to!
> >
> >
> > -----Original Message-
> > From: Hany NASR
> > Sent: Thursday, November 15, 2018 4:18 PM
> > To: user@nutch.apache.org
> > Subject: RE: Block certain parts of HTML code from being indexed
> >
> > Hello Markus,
> >
> > What if I want to remove specific component or page section?
> >
> > Kind regards,
> > Hany Shehata
> > Solutions Architect, Marketing and Communications IT Corporate Functions
> |
> > HSBC Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347
> > Kraków, Poland
> > __
> >
> > Tie line: 7148 7689 4698
> > External: +48 123 42 0698
> > Mobile: +48 723 680 278
> > E-mail: hany.n...@hsbc.com
> > ______________
> > Protect our environment - please only print this if you have to!
> >
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: Wednesday, November 14, 2018 4:11 PM
> > To: user@nutch.apache.org
> > Subject: RE: Block certain parts of HTML code from being indexed
> >
> > Hello Hany,
> >
> > Using parse-tika as your HTML parser, you can enable Boilerpipe (see
> > nutch-default).
> >
> > Regards,
> > Markus
> >
> >
> >
> > -Original message-
> > > From:hany.n...@hsbc.com 
> > > Sent: Wednesday 14th November 2018 15:53
> > > To: user@nutch.apache.org
> > > Subject: Block certain parts of HTML code from being indexed
> > >
> > > Hello All,
> > >
> > > I am using Nutch 1.15, and wondering if there is a feature for blocking
> > certain parts of HTML code from being indexed (header & footer).
> > >
> > > Kind regards,
> > > Hany Shehata
> > > Solutions Architect, Marketing and Communications IT Corporate
> > > Functions | HSBC Operations, Services and Technology (HOST) ul.
> > > Kapelanka 42A, 30-347 Kraków, Poland
> > > __
> > >
> > > Tie line: 7148 7689 4698
> > > External: +48 123 42 0698
> > > Mobile: +48 723 680 278
> > > E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>
> > > __
> > > Protect our environment - please only print this if you have to!
> > >

Re: Block certain parts of HTML code from being indexed

2018-11-16 Thread Jorge Betancourt
Hi Hany,

As BlackIce said, there is an open issue on
https://issues.apache.org/jira/browse/NUTCH-585 specifically the
(blacklist_whitelist_plugin) by now I'm not sure (probably not) that the
patch can be applied directly to master, but should provide a good general
idea on how to write a custom plugin for removing specific HTML nodes from
the crawl.

Hope it helps,
Jorge

On Fri, Nov 16, 2018 at 10:30 AM BlackIce  wrote:

> There was a plugin awhile ago which allowed you to specify different tags
> to be indexed or excluded from being indexed if I'm not mistaken it was
> this:
>
>
> http://www.longconnections.com/blog/2015/6/3/using-apache-nutchsolr-to-build-a-search-engine-with-auto-complete-feature
>
> Good luck and please let me know what you come up with, Thank you!
>
> On Fri, Nov 16, 2018 at 10:04 AM  wrote:
>
> > Anyone was facing this requirement before?
> >
> > Kind regards,
> > Hany Shehata
> > Solutions Architect, Marketing and Communications IT
> > Corporate Functions | HSBC Operations, Services and Technology (HOST)
> > ul. Kapelanka 42A, 30-347 Kraków, Poland
> > __
> >
> > Tie line: 7148 7689 4698
> > External: +48 123 42 0698
> > Mobile: +48 723 680 278
> > E-mail: hany.n...@hsbc.com
> > __
> > Protect our environment - please only print this if you have to!
> >
> >
> > -Original Message-
> > From: Hany NASR
> > Sent: Thursday, November 15, 2018 4:18 PM
> > To: user@nutch.apache.org
> > Subject: RE: Block certain parts of HTML code from being indexed
> >
> > Hello Markus,
> >
> > What if I want to remove specific component or page section?
> >
> > Kind regards,
> > Hany Shehata
> > Solutions Architect, Marketing and Communications IT Corporate Functions
> |
> > HSBC Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347
> > Kraków, Poland
> > __
> >
> > Tie line: 7148 7689 4698
> > External: +48 123 42 0698
> > Mobile: +48 723 680 278
> > E-mail: hany.n...@hsbc.com
> > __________
> > Protect our environment - please only print this if you have to!
> >
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: Wednesday, November 14, 2018 4:11 PM
> > To: user@nutch.apache.org
> > Subject: RE: Block certain parts of HTML code from being indexed
> >
> > Hello Hany,
> >
> > Using parse-tika as your HTML parser, you can enable Boilerpipe (see
> > nutch-default).
> >
> > Regards,
> > Markus
> >
> >
> >
> > -Original message-
> > > From:hany.n...@hsbc.com 
> > > Sent: Wednesday 14th November 2018 15:53
> > > To: user@nutch.apache.org
> > > Subject: Block certain parts of HTML code from being indexed
> > >
> > > Hello All,
> > >
> > > I am using Nutch 1.15, and wondering if there is a feature for blocking
> > certain parts of HTML code from being indexed (header & footer).
> > >
> > > Kind regards,
> > > Hany Shehata
> > > Solutions Architect, Marketing and Communications IT Corporate
> > > Functions | HSBC Operations, Services and Technology (HOST) ul.
> > > Kapelanka 42A, 30-347 Kraków, Poland
> > > __
> > >
> > > Tie line: 7148 7689 4698
> > > External: +48 123 42 0698
> > > Mobile: +48 723 680 278
> > > E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>
> > > __
> > > Protect our environment - please only print this if you have to!
> > >
> > >
> > >
> > > -
> > > SAVE PAPER - THINK BEFORE YOU PRINT!
> > >
> > > This E-mail is confidential.
> > >
> > > It may also be legally privileged. If you are not the addressee you
> > > may not copy, forward, disclose or use any part of it. If you have
> > > received this message in error, please delete it and all copies from
> > > your system and notify the sender immediately by return E-mail.
> > >
> > > Internet communications cannot be guaranteed to be timely secure, error
> > or virus-free.
> > > The sender does not accept liability for any err

RE: Block certain parts of HTML code from being indexed

2018-11-16 Thread hany . nasr
Anyone was facing this requirement before?

Kind regards, 
Hany Shehata
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!


-Original Message-
From: Hany NASR 
Sent: Thursday, November 15, 2018 4:18 PM
To: user@nutch.apache.org
Subject: RE: Block certain parts of HTML code from being indexed

Hello Markus,

What if I want to remove specific component or page section?

Kind regards,
Hany Shehata
Solutions Architect, Marketing and Communications IT Corporate Functions | HSBC 
Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347 Kraków, 
Poland __ 

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Wednesday, November 14, 2018 4:11 PM
To: user@nutch.apache.org
Subject: RE: Block certain parts of HTML code from being indexed

Hello Hany,

Using parse-tika as your HTML parser, you can enable Boilerpipe (see 
nutch-default).

Regards,
Markus

 
 
-Original message-
> From:hany.n...@hsbc.com 
> Sent: Wednesday 14th November 2018 15:53
> To: user@nutch.apache.org
> Subject: Block certain parts of HTML code from being indexed
> 
> Hello All,
> 
> I am using Nutch 1.15, and wondering if there is a feature for blocking 
> certain parts of HTML code from being indexed (header & footer).
> 
> Kind regards,
> Hany Shehata
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul.
> Kapelanka 42A, 30-347 Kraków, Poland
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>
> __
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.
> 
> It may also be legally privileged. If you are not the addressee you 
> may not copy, forward, disclose or use any part of it. If you have 
> received this message in error, please delete it and all copies from 
> your system and notify the sender immediately by return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 


***
This message originated from the Internet. Its originator may or may not be who 
they claim to be and the information contained in the message and any 
attachments may or may not be accurate.


 


-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


RE: Block certain parts of HTML code from being indexed

2018-11-15 Thread hany . nasr
Hello Markus,

What if I want to remove specific component or page section?

Kind regards, 
Hany Shehata
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, November 14, 2018 4:11 PM
To: user@nutch.apache.org
Subject: RE: Block certain parts of HTML code from being indexed

Hello Hany,

Using parse-tika as your HTML parser, you can enable Boilerpipe (see 
nutch-default).

Regards,
Markus

 
 
-Original message-
> From:hany.n...@hsbc.com 
> Sent: Wednesday 14th November 2018 15:53
> To: user@nutch.apache.org
> Subject: Block certain parts of HTML code from being indexed
> 
> Hello All,
> 
> I am using Nutch 1.15, and wondering if there is a feature for blocking 
> certain parts of HTML code from being indexed (header & footer).
> 
> Kind regards,
> Hany Shehata
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul. 
> Kapelanka 42A, 30-347 Kraków, Poland 
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>
> __
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.
> 
> It may also be legally privileged. If you are not the addressee you 
> may not copy, forward, disclose or use any part of it. If you have 
> received this message in error, please delete it and all copies from 
> your system and notify the sender immediately by return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 


***
This message originated from the Internet. Its originator may or may not be who 
they claim to be and the information contained in the message and any 
attachments may or may not be accurate.


 


-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


RE: Block certain parts of HTML code from being indexed

2018-11-14 Thread Markus Jelsma
Hello Hany,

Using parse-tika as your HTML parser, you can enable Boilerpipe (see 
nutch-default).

Regards,
Markus

 
 
-Original message-
> From:hany.n...@hsbc.com 
> Sent: Wednesday 14th November 2018 15:53
> To: user@nutch.apache.org
> Subject: Block certain parts of HTML code from being indexed
> 
> Hello All,
> 
> I am using Nutch 1.15, and wondering if there is a feature for blocking 
> certain parts of HTML code from being indexed (header & footer).
> 
> Kind regards,
> Hany Shehata
> Solutions Architect, Marketing and Communications IT
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy,
> forward, disclose or use any part of it. If you have received this message in 
> error,
> please delete it and all copies from your system and notify the sender 
> immediately by
> return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 


RE: Block certain parts of HTML code from being indexed

2018-11-14 Thread Yossi Tamari
Hi Hany,

The Tika parser supports Boilerpipe for header and footer removal, but I don't 
know how well it works.
You can test it online at https://boilerpipe-web.appspot.com/


> -Original Message-
> From: hany.n...@hsbc.com 
> Sent: 14 November 2018 16:53
> To: user@nutch.apache.org
> Subject: Block certain parts of HTML code from being indexed
> 
> Hello All,
> 
> I am using Nutch 1.15, and wondering if there is a feature for blocking 
> certain
> parts of HTML code from being indexed (header & footer).
> 
> Kind regards,
> Hany Shehata
> Solutions Architect, Marketing and Communications IT Corporate Functions |
> HSBC Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347
> Kraków, Poland
> _
> _
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> _
> _
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy,
> forward, disclose or use any part of it. If you have received this message in
> error, please delete it and all copies from your system and notify the sender
> immediately by return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or
> virus-free.
> The sender does not accept liability for any errors or omissions.