This is good to know, Markus. This presents some challenge: - In wide-spectrum crawling, it is hard to know the page structure ahead of time. - Even if we do, how do we specify something conditional in nutch-site.xml?
On Wed, Jul 17, 2013 at 2:10 PM, Markus Jelsma <[email protected]>wrote: > Yes! Boilerpipe is the best open source alternative and has a working > patch for Nutch! There are also some other open source extraction toolkits > but they have not been ported to Tika or do not directly work with SAX > ContentHandlers (usable in Tika) so they would require some work there plus > integration in Nutch. > > The problem with Boilerpipe is that is has different extractors so you > must use ArticleExtractor for article pages but Canola for pages with many > blocks. > > -----Original message----- > > From:Sebastian Nagel <[email protected]> > > Sent: Wednesday 17th July 2013 22:37 > > To: [email protected] > > Subject: Re: Nutch how to crawl but not index the site navigation (w/ > Solr) > > > > Hi, > > > > the answer depends on the use case: > > > > 1. remove navigation for any page while crawling lot of sites: see > > NUTCH-961. The "boilerpipe" heuristics is often exciting but sometimes > > it fails. > > > > 2. for a couple of sites you have control or you know well: > > implement a parse filter plugin > > ( > http://nutch.apache.org/apidocs-1.7/org/apache/nutch/parse/HtmlParseFilter.html). > Method "filter" > > then should return a ParseResult > > with replaced ParseText. To cleanse ParseText you have to construct the > plain > > text from DOM anew while skipping certain navigation tags (by element, > class name, or id). > > See also: > > > http://lucene.472066.n3.nabble.com/Can-Nutch-index-parse-targeted-sections-of-a-web-page-td1785541.html > > http://lucene.472066.n3.nabble.com/how-to-adjust-content-tt3584160.html > > > > Cheers, > > Sebastian > > > > > > On 07/17/2013 09:35 PM, dogrdon wrote: > > > admittedly this is a cross-post from stackoverflow, but I don't know > if there > > > are a whole lot of Nutch folks over there. > > > > > > My question is about crawling HTML navigation menus, but not indexing > the > > > text for those links in Solr. > > > > > > While I have seen some older discussions from several years ago about > making > > > this an option in later development, but I am not really finding > anything > > > via searching that gives a good indication of how one might exlude site > > > navigation menu content from the content that Nutch indexes to Solr > during a > > > crawl. > > > > > > That is, I am seeing the navigation menu text in all content that is > getting > > > indexed and this damages search because then all content will have the > same > > > text in it. Obviously I want to keep using the site navigation for > crawling, > > > but I don't want it indexed. Is there a best practice for > accomplishing this > > > with Nutch? Like a way to wrap the navigation in some kind of tag , for > > > example? > > > > > > I am new to Nutch (obviously) so I don't know the best place that this > would > > > be accomplished. > > > > > > thanks very much. > > > > > > > > > > > > > > > -- > > > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-how-to-crawl-but-not-index-the-site-navigation-w-Solr-tp4078702.html > > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > > >

