A very basic facility seem to be missing in Nutch. If I have a 2000
urls list in Nutch DB and want to ignore external links, I have to
build a regex-filter with thousands of different domain I want to
crawl. No parameter to only crawl the different domain and ignore
external links.
At these times, is there another solution ? Has anybody worked on that ?
We did something similar, though not exactly the same.
We've got a list of "favored domains", and we use this to boost link
scores in the FetchListTool before sorting and selecting the topN. So
you could easily apply the same approach to strip out any URLs that
aren't in your domain set.
Another approach that I haven't tried would be to set the external
link weight (db.score.link.external) to 0. So any new page added by a
link that's "leaving" a domain effectively get a score of 0. Two
problems I can think of are (a) if you have a link between pages from
two of your target domains, this might cause problems, and (b)
without mods to FetchListTool you still might wind up fetching a page
with a score of 0.
-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200