RE: Nutch how to crawl but not index the site navigation (w/ Solr)

Markus Jelsma Wed, 17 Jul 2013 14:11:06 -0700

Yes! Boilerpipe is the best open source alternative and has a working patch for 
Nutch! There are also some other open source extraction toolkits but they have 
not been ported to Tika or do not directly work with SAX ContentHandlers 
(usable in Tika) so they would require some work there plus integration in 
Nutch.


The problem with Boilerpipe is that is has different extractors so you must use 
ArticleExtractor for article pages but Canola for pages with many blocks.
 
-----Original message-----
> From:Sebastian Nagel <[email protected]>
> Sent: Wednesday 17th July 2013 22:37
> To: [email protected]
> Subject: Re: Nutch how to crawl but not index the site navigation (w/ Solr)
> 
> Hi,
> 
> the answer depends on the use case:
> 
> 1. remove navigation for any page while crawling lot of sites: see
> NUTCH-961. The "boilerpipe" heuristics is often exciting but sometimes
> it fails.
> 
> 2. for a couple of sites you have control or you know well:
> implement a parse filter plugin
> (http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HtmlParseFilter.html).
>  Method "filter"
> then should return a ParseResult
> with replaced ParseText. To cleanse ParseText you have to construct the plain
> text from DOM anew while skipping certain navigation tags (by element, class 
> name, or id).
> See also:
>  
> http://lucene.472066.n3.nabble.com/Can-Nutch-index-parse-targeted-sections-of-a-web-page-td1785541.html
>  http://lucene.472066.n3.nabble.com/how-to-adjust-content-tt3584160.html
> 
> Cheers,
> Sebastian
> 
> 
> On 07/17/2013 09:35 PM, dogrdon wrote:
> > admittedly this is a cross-post from stackoverflow, but I don't know if 
> > there
> > are a whole lot of Nutch folks over there.
> > 
> > My question is about crawling HTML navigation menus, but not indexing the
> > text for those links in Solr.
> > 
> > While I have seen some older discussions from several years ago about making
> > this an option in later development, but I am not really finding anything
> > via searching that gives a good indication of how one might exlude site
> > navigation menu content from the content that Nutch indexes to Solr during a
> > crawl.
> > 
> > That is, I am seeing the navigation menu text in all content that is getting
> > indexed and this damages search because then all content will have the same
> > text in it. Obviously I want to keep using the site navigation for crawling,
> > but I don't want it indexed. Is there a best practice for accomplishing this
> > with Nutch? Like a way to wrap the navigation in some kind of tag , for
> > example?
> > 
> > I am new to Nutch (obviously) so I don't know the best place that this would
> > be accomplished.
> > 
> > thanks very much.
> > 
> > 
> > 
> > 
> > --
> > View this message in context: 
> > http://lucene.472066.n3.nabble.com/Nutch-how-to-crawl-but-not-index-the-site-navigation-w-Solr-tp4078702.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> > 
> 
>

RE: Nutch how to crawl but not index the site navigation (w/ Solr)

Reply via email to