Maybe you can set this property to limit the count of allowed URLs per host / domain. default is -1.
<property> <name>generate.max.count</name> <value>-1</value> <description>The maximum number of urls in a single fetchlist. -1 if unlimited. The urls are counted according to the value of the parameter generator.count.mode. </description> </property> On Tue, Apr 29, 2014 at 11:14 AM, S.L <[email protected]> wrote: > Hi All, > > I am crawling multiple big websites for which I have the homepage as the > URL in the seed file. The problem I am facing is that one of the websites > is getting crawled at a faster pace than the rest of the websites and as a > result the indexed data contains a disproportionate number of entries for > this one website. > > I suspect that this is happening because this website in question has > homepage with the maximum number of outlinks. > > My questions is how can I control the behaviour of Nutch so as to crawl > every host/domain in a balanced way. > > I am using Nutch 1.7 > > Thanks. > -- Don't Grow Old, Grow Up... :-)

