Hi Talat, I think it's something worth doing since it would boost crawling significantly. I will see what I can do and will start a jira once I have something.
Thanks, Diaa On Mon, May 12, 2014 at 12:22 AM, Talat Uyarer <[email protected]> wrote: > Hi Diaa, > > Good question, but now that is impossible. When you use topN parameter > Nutch pays attend to list that ordered by score. If you want to take > same number for each host, you can use different webpage table. Or If > you are willing develop this feature for Nutch I can help you > > Talat > > 2014-05-11 18:00 GMT+03:00 Diaa Abdallah <[email protected]>: > > When crawling multiple websites how would I enforce that nutch generates > > and adds to the fetch list multiple hosts rather than the same host. > > > > For example: > > Lets say we have 3 websites with the following discovered pages: > > a.com 100 > > b.com 100 > > c.com 100 > > > > When I generate topN 30 I'd like to make sure that these 30 are > > proportional from each page so that it would take: > > 10 from a.com > > 10 from b.com > > 10 from c.com > > > > Rather than take 30 from just a.com > > > > This happens when the webpages from a.com have a better score. > > The harm here lies in that if only a.com generates the pages the crawl > > would have less throughput since it takes longer to retrieve 30 from > just a > > rather than 10 from each, since there is a delay for each time the host > is > > crawled. > > > > Regards, > > Diaa > > > > -- > Talat UYARER > Websitesi: http://talat.uyarer.com > Twitter: http://twitter.com/talatuyarer > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304 >

