Thanks , Feng , but that not what we want though, you mean there is no mechanism by which we can set a limit for a host to fetch at each level and put the rest in the queue so that we have a equal representation from all hosts while the index is being built up ?
On Wed, Apr 30, 2014 at 1:26 AM, feng lu <[email protected]> wrote: > yes, that's right. > > > On Tue, Apr 29, 2014 at 10:53 PM, S.L <[email protected]> wrote: > > > Thanks,will this skip any URLs at each level/fetch if a particular host > has > > more than the value we set it to ? > > > > > > On Tue, Apr 29, 2014 at 10:48 AM, feng lu <[email protected]> wrote: > > > > > Maybe you can set this property to limit the count of allowed URLs per > > host > > > / domain. default is -1. > > > > > > <property> > > > <name>generate.max.count</name> > > > <value>-1</value> > > > <description>The maximum number of urls in a single > > > fetchlist. -1 if unlimited. The urls are counted according > > > to the value of the parameter generator.count.mode. > > > </description> > > > </property> > > > > > > > > > > > > On Tue, Apr 29, 2014 at 11:14 AM, S.L <[email protected]> > wrote: > > > > > > > Hi All, > > > > > > > > I am crawling multiple big websites for which I have the homepage as > > the > > > > URL in the seed file. The problem I am facing is that one of the > > websites > > > > is getting crawled at a faster pace than the rest of the websites and > > as > > > a > > > > result the indexed data contains a disproportionate number of entries > > for > > > > this one website. > > > > > > > > I suspect that this is happening because this website in question has > > > > homepage with the maximum number of outlinks. > > > > > > > > My questions is how can I control the behaviour of Nutch so as to > crawl > > > > every host/domain in a balanced way. > > > > > > > > I am using Nutch 1.7 > > > > > > > > Thanks. > > > > > > > > > > > > > > > > -- > > > Don't Grow Old, Grow Up... :-) > > > > > > > > > -- > Don't Grow Old, Grow Up... :-) >

