Hi,

the answer depends on the use case:

1. remove navigation for any page while crawling lot of sites: see
NUTCH-961. The "boilerpipe" heuristics is often exciting but sometimes
it fails.

2. for a couple of sites you have control or you know well:
implement a parse filter plugin
(http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HtmlParseFilter.html).
 Method "filter"
then should return a ParseResult
with replaced ParseText. To cleanse ParseText you have to construct the plain
text from DOM anew while skipping certain navigation tags (by element, class 
name, or id).
See also:
 
http://lucene.472066.n3.nabble.com/Can-Nutch-index-parse-targeted-sections-of-a-web-page-td1785541.html
 http://lucene.472066.n3.nabble.com/how-to-adjust-content-tt3584160.html

Cheers,
Sebastian


On 07/17/2013 09:35 PM, dogrdon wrote:
> admittedly this is a cross-post from stackoverflow, but I don't know if there
> are a whole lot of Nutch folks over there.
> 
> My question is about crawling HTML navigation menus, but not indexing the
> text for those links in Solr.
> 
> While I have seen some older discussions from several years ago about making
> this an option in later development, but I am not really finding anything
> via searching that gives a good indication of how one might exlude site
> navigation menu content from the content that Nutch indexes to Solr during a
> crawl.
> 
> That is, I am seeing the navigation menu text in all content that is getting
> indexed and this damages search because then all content will have the same
> text in it. Obviously I want to keep using the site navigation for crawling,
> but I don't want it indexed. Is there a best practice for accomplishing this
> with Nutch? Like a way to wrap the navigation in some kind of tag , for
> example?
> 
> I am new to Nutch (obviously) so I don't know the best place that this would
> be accomplished.
> 
> thanks very much.
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Nutch-how-to-crawl-but-not-index-the-site-navigation-w-Solr-tp4078702.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

Reply via email to