Hi Diaa, Good question, but now that is impossible. When you use topN parameter Nutch pays attend to list that ordered by score. If you want to take same number for each host, you can use different webpage table. Or If you are willing develop this feature for Nutch I can help you
Talat 2014-05-11 18:00 GMT+03:00 Diaa Abdallah <[email protected]>: > When crawling multiple websites how would I enforce that nutch generates > and adds to the fetch list multiple hosts rather than the same host. > > For example: > Lets say we have 3 websites with the following discovered pages: > a.com 100 > b.com 100 > c.com 100 > > When I generate topN 30 I'd like to make sure that these 30 are > proportional from each page so that it would take: > 10 from a.com > 10 from b.com > 10 from c.com > > Rather than take 30 from just a.com > > This happens when the webpages from a.com have a better score. > The harm here lies in that if only a.com generates the pages the crawl > would have less throughput since it takes longer to retrieve 30 from just a > rather than 10 from each, since there is a delay for each time the host is > crawled. > > Regards, > Diaa -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

