Hi, the answer depends on the use case:
1. remove navigation for any page while crawling lot of sites: see NUTCH-961. The "boilerpipe" heuristics is often exciting but sometimes it fails. 2. for a couple of sites you have control or you know well: implement a parse filter plugin (http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HtmlParseFilter.html). Method "filter" then should return a ParseResult with replaced ParseText. To cleanse ParseText you have to construct the plain text from DOM anew while skipping certain navigation tags (by element, class name, or id). See also: http://lucene.472066.n3.nabble.com/Can-Nutch-index-parse-targeted-sections-of-a-web-page-td1785541.html http://lucene.472066.n3.nabble.com/how-to-adjust-content-tt3584160.html Cheers, Sebastian On 07/17/2013 09:35 PM, dogrdon wrote: > admittedly this is a cross-post from stackoverflow, but I don't know if there > are a whole lot of Nutch folks over there. > > My question is about crawling HTML navigation menus, but not indexing the > text for those links in Solr. > > While I have seen some older discussions from several years ago about making > this an option in later development, but I am not really finding anything > via searching that gives a good indication of how one might exlude site > navigation menu content from the content that Nutch indexes to Solr during a > crawl. > > That is, I am seeing the navigation menu text in all content that is getting > indexed and this damages search because then all content will have the same > text in it. Obviously I want to keep using the site navigation for crawling, > but I don't want it indexed. Is there a best practice for accomplishing this > with Nutch? Like a way to wrap the navigation in some kind of tag , for > example? > > I am new to Nutch (obviously) so I don't know the best place that this would > be accomplished. > > thanks very much. > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-how-to-crawl-but-not-index-the-site-navigation-w-Solr-tp4078702.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

