Re: Nutch how to crawl but not index the site navigation (w/ Solr)

Joe Zhang Fri, 19 Jul 2013 10:23:26 -0700

This is good to know, Markus. This presents some challenge:

- In wide-spectrum crawling, it is hard to know the page structure ahead of
time.
- Even if we do, how do we specify something conditional in nutch-site.xml?



On Wed, Jul 17, 2013 at 2:10 PM, Markus Jelsma
<[email protected]>wrote:

> Yes! Boilerpipe is the best open source alternative and has a working
> patch for Nutch! There are also some other open source extraction toolkits
> but they have not been ported to Tika or do not directly work with SAX
> ContentHandlers (usable in Tika) so they would require some work there plus
> integration in Nutch.
>
> The problem with Boilerpipe is that is has different extractors so you
> must use ArticleExtractor for article pages but Canola for pages with many
> blocks.
>
> -----Original message-----
> > From:Sebastian Nagel <[email protected]>
> > Sent: Wednesday 17th July 2013 22:37
> > To: [email protected]
> > Subject: Re: Nutch how to crawl but not index the site navigation (w/
> Solr)
> >
> > Hi,
> >
> > the answer depends on the use case:
> >
> > 1. remove navigation for any page while crawling lot of sites: see
> > NUTCH-961. The "boilerpipe" heuristics is often exciting but sometimes
> > it fails.
> >
> > 2. for a couple of sites you have control or you know well:
> > implement a parse filter plugin
> > (
> http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HtmlParseFilter.html).
> Method "filter"
> > then should return a ParseResult
> > with replaced ParseText. To cleanse ParseText you have to construct the
> plain
> > text from DOM anew while skipping certain navigation tags (by element,
> class name, or id).
> > See also:
> >
> http://lucene.472066.n3.nabble.com/Can-Nutch-index-parse-targeted-sections-of-a-web-page-td1785541.html
> >  http://lucene.472066.n3.nabble.com/how-to-adjust-content-tt3584160.html
> >
> > Cheers,
> > Sebastian
> >
> >
> > On 07/17/2013 09:35 PM, dogrdon wrote:
> > > admittedly this is a cross-post from stackoverflow, but I don't know
> if there
> > > are a whole lot of Nutch folks over there.
> > >
> > > My question is about crawling HTML navigation menus, but not indexing
> the
> > > text for those links in Solr.
> > >
> > > While I have seen some older discussions from several years ago about
> making
> > > this an option in later development, but I am not really finding
> anything
> > > via searching that gives a good indication of how one might exlude site
> > > navigation menu content from the content that Nutch indexes to Solr
> during a
> > > crawl.
> > >
> > > That is, I am seeing the navigation menu text in all content that is
> getting
> > > indexed and this damages search because then all content will have the
> same
> > > text in it. Obviously I want to keep using the site navigation for
> crawling,
> > > but I don't want it indexed. Is there a best practice for
> accomplishing this
> > > with Nutch? Like a way to wrap the navigation in some kind of tag , for
> > > example?
> > >
> > > I am new to Nutch (obviously) so I don't know the best place that this
> would
> > > be accomplished.
> > >
> > > thanks very much.
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-how-to-crawl-but-not-index-the-site-navigation-w-Solr-tp4078702.html
> > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > >
> >
> >
>

Re: Nutch how to crawl but not index the site navigation (w/ Solr)

Reply via email to