Hi There are 2 parameters that are used exactly for this purpose (generate.max.count && generator.count.mode). Look at nutch-default.xml for a description. These are available in both versions of Nutch and allow you to set a max number of URLs from the same host/domain/IP in a fetchlist.
Julien On 11 May 2014 23:22, Talat Uyarer <[email protected]> wrote: > Hi Diaa, > > Good question, but now that is impossible. When you use topN parameter > Nutch pays attend to list that ordered by score. If you want to take > same number for each host, you can use different webpage table. Or If > you are willing develop this feature for Nutch I can help you > > Talat > > 2014-05-11 18:00 GMT+03:00 Diaa Abdallah <[email protected]>: > > When crawling multiple websites how would I enforce that nutch generates > > and adds to the fetch list multiple hosts rather than the same host. > > > > For example: > > Lets say we have 3 websites with the following discovered pages: > > a.com 100 > > b.com 100 > > c.com 100 > > > > When I generate topN 30 I'd like to make sure that these 30 are > > proportional from each page so that it would take: > > 10 from a.com > > 10 from b.com > > 10 from c.com > > > > Rather than take 30 from just a.com > > > > This happens when the webpages from a.com have a better score. > > The harm here lies in that if only a.com generates the pages the crawl > > would have less throughput since it takes longer to retrieve 30 from > just a > > rather than 10 from each, since there is a delay for each time the host > is > > crawled. > > > > Regards, > > Diaa > > > > -- > Talat UYARER > Websitesi: http://talat.uyarer.com > Twitter: http://twitter.com/talatuyarer > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304 > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

