Hi Jorge, I was able to do what you suggested below and with success! Thanks so much for the help!
Jackie -----Original Message----- From: Jorge Luis Betancourt González [mailto:[email protected]] Sent: Thursday, March 26, 2015 3:01 PM To: [email protected] Subject: Re: [MASSMAIL]RE: Ignore navigation during index This patch that you mention should work nicely as long as you can provide the tags that you want to be excluded, so if is an internal Intranet or some sites that don't change a lot this should work. The Boilerpipe techinque suggested by Markus is a more general solution as it uses a library that it uses some clever techniques to distinguish what is actually content and what is "noise" in the webpage. The choice is yours! As for applying the patches, you should checkout the source code for the version you're using and then apply the patch in the root of the checkout code, this command should do the trick (the patch file attached to the should be downloaded). patch -p0 < ~/Downloads/NUTCH-1928v5.patch Afterwards you just need to compile a new binary from the patched source following the instructions in the README file. Regards, ----- Original Message ----- From: "Jacquelyn F. Richardson" <[email protected]> To: [email protected] Sent: Thursday, March 26, 2015 11:57:41 AM Subject: [MASSMAIL]RE: Ignore navigation during index Hi Markus, Thanks for the reply. While waiting I found this: https://issues.apache.org/jira/browse/NUTCH-585 Are you familiar with this patch? How does this compare with your suggestion? There are three attachments on the page. Which is the correct patch? I have never applied a patch to nutch before. Could you point me in the right direction as far as instructions for applying a patch to my environment? Jackie -----Original Message----- From: Markus Jelsma [mailto:[email protected]] Sent: Thursday, March 26, 2015 11:33 AM To: [email protected] Subject: RE: Ignore navigation during index Hello - check out NUTCH-961. It adds support for Boilerpipe to Nutch' Tika parser. It's crude but works reasonably. https://issues.apache.org/jira/browse/NUTCH-961 Markus -----Original message----- > From:Richardson, Jacquelyn F. <[email protected]> > Sent: Thursday 26th March 2015 16:20 > To: [email protected] > Subject: Ignore navigation during index > > Hi, > > Is there a way to tell nutch to ignore the navigation or footer parts of an > html page during the crawl process? Specifically I do not want the > information in the navigation or footer to be indexed. My environment is > Windows 7 with Cygwin, Java 1.7, nutch 1.9 (binary not source) and solr 4.7. > > Any assistance will be greatly appreciated. > > Thanks, > Jackie > >

