It's a use case for a fetch filter: https://issues.apache.org/jira/browse/NUTCH-828
-----Original message----- > From:Alexander Aristov <[email protected]> > Sent: Sun 01-Jul-2012 20:43 > To: [email protected]; [email protected] > Subject: Re: Language-focused crawling > > Hi > > First of all you understand that in order to detect page language the page > must be crawled and at least sent to parser. As you admitted > language-identifier filter adds lang field and that's it. > > You will need to modify or write your own filter that would discard > unwanted languages (return null). > > scoring filters are something different and not suitable for the purpose. > > As for indexing pages referenced by desired paged then the solution might > be is to add a flag to outlink metadata which then would be used to pass > the page through your filter. > > This all is not really difficult if you have necessary programing skills > and strong desire. :) > > > Best Regards > Alexander Aristov > > > On 1 July 2012 17:00, Safdar Kureishy <[email protected]> wrote: > > > Hi, > > > > I would like to do a focused web crawl using Nutch, for all pages of a > > specific language - let's say "lang". However, the default > > language-identifier plugin from Nutch does not support this language. > > > > The heuristic I'd like to use is that I want all pages pointed to by pages > > containing "lang" content to be crawled, but pages that are pointed to by > > non-"lang" pages should not be crawled (unless at least one "lang" page > > points to it). It appears that I would need to create a ScoringFilter for > > this, and exploit the distributeScoreToOutlinks() and updateDbScore() > > methods of the filter. However, before I embark on that journey, I thought > > I'd ask if there is already a solution to this problem of a language > > focused crawl in any Nutch plugin library somewhere, that supports an > > extensive list of languages? > > > > Thanks, > > Safdar > > >

