Hi Jorge,

I was able to do what you suggested below and with success!  Thanks so much for 
the help!

Jackie

-----Original Message-----
From: Jorge Luis Betancourt González [mailto:[email protected]] 
Sent: Thursday, March 26, 2015 3:01 PM
To: [email protected]
Subject: Re: [MASSMAIL]RE: Ignore navigation during index

This patch that you mention should work nicely as long as you can provide the 
tags that you want to be excluded, so if is an internal Intranet or some sites 
that don't change a lot this should work. The Boilerpipe techinque suggested by 
Markus is a more general solution as it uses a library that it uses some clever 
techniques to distinguish what is actually content and what is "noise" in the 
webpage. The choice is yours!

As for applying the patches, you should checkout the source code for the 
version you're using and then apply the patch in the root of the checkout code, 
this command should do the trick (the patch file attached to the should be 
downloaded).

patch -p0 < ~/Downloads/NUTCH-1928v5.patch

Afterwards you just need to compile a new binary from the patched source 
following the instructions in the README file.

Regards,

----- Original Message -----
From: "Jacquelyn F. Richardson" <[email protected]>
To: [email protected]
Sent: Thursday, March 26, 2015 11:57:41 AM
Subject: [MASSMAIL]RE: Ignore navigation during index

Hi Markus,

Thanks for the reply.  While waiting I found this:
https://issues.apache.org/jira/browse/NUTCH-585

Are you familiar with this patch?  How does this compare with your suggestion?

There are three attachments on the page.  Which is the correct patch?

I have never applied a patch to nutch before.  Could you point me in the right 
direction as far as instructions for applying a patch to my environment?

Jackie

-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Thursday, March 26, 2015 11:33 AM
To: [email protected]
Subject: RE: Ignore navigation during index

Hello - check out NUTCH-961. It adds support for Boilerpipe to Nutch' Tika 
parser. It's crude but works reasonably.
https://issues.apache.org/jira/browse/NUTCH-961

Markus
 
 
-----Original message-----
> From:Richardson, Jacquelyn F. <[email protected]>
> Sent: Thursday 26th March 2015 16:20
> To: [email protected]
> Subject: Ignore navigation during index
> 
> Hi,
> 
> Is there a way to tell nutch to ignore the navigation or footer parts of an 
> html page during the crawl process?  Specifically I do not want the 
> information in the navigation or footer to be indexed.  My environment is 
> Windows 7 with Cygwin, Java 1.7, nutch 1.9 (binary not source) and solr 4.7.
> 
> Any assistance will be greatly appreciated.
> 
> Thanks,
> Jackie
> 
> 

Reply via email to