Hi First of all you understand that in order to detect page language the page must be crawled and at least sent to parser. As you admitted language-identifier filter adds lang field and that's it.
You will need to modify or write your own filter that would discard unwanted languages (return null). scoring filters are something different and not suitable for the purpose. As for indexing pages referenced by desired paged then the solution might be is to add a flag to outlink metadata which then would be used to pass the page through your filter. This all is not really difficult if you have necessary programing skills and strong desire. :) Best Regards Alexander Aristov On 1 July 2012 17:00, Safdar Kureishy <[email protected]> wrote: > Hi, > > I would like to do a focused web crawl using Nutch, for all pages of a > specific language - let's say "lang". However, the default > language-identifier plugin from Nutch does not support this language. > > The heuristic I'd like to use is that I want all pages pointed to by pages > containing "lang" content to be crawled, but pages that are pointed to by > non-"lang" pages should not be crawled (unless at least one "lang" page > points to it). It appears that I would need to create a ScoringFilter for > this, and exploit the distributeScoreToOutlinks() and updateDbScore() > methods of the filter. However, before I embark on that journey, I thought > I'd ask if there is already a solution to this problem of a language > focused crawl in any Nutch plugin library somewhere, that supports an > extensive list of languages? > > Thanks, > Safdar >

