A very basic facility seem to be missing in Nutch. If I have a 2000 urls list in Nutch DB and want to ignore external links, I have to build a regex-filter with thousands of different domain I want to crawl. No parameter to only crawl the different domain and ignore external links.

At these times, is there another solution ? Has anybody worked on that ?

We did something similar, though not exactly the same.

We've got a list of "favored domains", and we use this to boost link scores in the FetchListTool before sorting and selecting the topN. So you could easily apply the same approach to strip out any URLs that aren't in your domain set.

Another approach that I haven't tried would be to set the external link weight (db.score.link.external) to 0. So any new page added by a link that's "leaving" a domain effectively get a score of 0. Two problems I can think of are (a) if you have a link between pages from two of your target domains, this might cause problems, and (b) without mods to FetchListTool you still might wind up fetching a page with a score of 0.

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

Reply via email to