When crawling multiple websites how would I enforce that nutch generates
and adds to the fetch list multiple hosts rather than the same host.

For example:
Lets say we have 3 websites with the following discovered pages:
a.com 100
b.com 100
c.com 100

When I generate topN 30 I'd like to make sure that these 30 are
proportional from each page so that it would take:
10 from a.com
10 from b.com
10 from c.com

Rather than take 30 from just a.com

This happens when the webpages from a.com have a better score.
The harm here lies in that if only a.com generates the pages the crawl
would have less throughput since it takes longer to retrieve 30 from just a
rather than 10 from each, since there is a delay for each time the host is
crawled.

Regards,
Diaa

Reply via email to