Re: Ignore external links from crawled domains

Ken Krugler Mon, 08 Aug 2005 09:07:37 -0700

A very basic facility seem to be missing in Nutch. If I have a 2000urls list in Nutch DB and want to ignore external links, I have tobuild a regex-filter with thousands of different domain I want tocrawl. No parameter to only crawl the different domain and ignoreexternal links.
At these times, is there another solution ? Has anybody worked on that ?


We did something similar, though not exactly the same.

We've got a list of "favored domains", and we use this to boost linkscores in the FetchListTool before sorting and selecting the topN. Soyou could easily apply the same approach to strip out any URLs thataren't in your domain set.

Another approach that I haven't tried would be to set the externallink weight (db.score.link.external) to 0. So any new page added by alink that's "leaving" a domain effectively get a score of 0. Twoproblems I can think of are (a) if you have a link between pages fromtwo of your target domains, this might cause problems, and (b)without mods to FetchListTool you still might wind up fetching a pagewith a score of 0.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

Re: Ignore external links from crawled domains

Reply via email to