When crawling multiple websites how would I enforce that nutch generates and adds to the fetch list multiple hosts rather than the same host.
For example: Lets say we have 3 websites with the following discovered pages: a.com 100 b.com 100 c.com 100 When I generate topN 30 I'd like to make sure that these 30 are proportional from each page so that it would take: 10 from a.com 10 from b.com 10 from c.com Rather than take 30 from just a.com This happens when the webpages from a.com have a better score. The harm here lies in that if only a.com generates the pages the crawl would have less throughput since it takes longer to retrieve 30 from just a rather than 10 from each, since there is a delay for each time the host is crawled. Regards, Diaa

