Maybe you can set this property to limit the count of allowed URLs per host
/ domain. default is -1.

<property>
  <name>generate.max.count</name>
  <value>-1</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  </description>
</property>



On Tue, Apr 29, 2014 at 11:14 AM, S.L <[email protected]> wrote:

> Hi All,
>
> I am crawling multiple big websites for which I have the homepage as the
> URL in the seed file. The problem I am facing is that one of the websites
> is getting crawled at a faster pace than the rest of the websites and as a
> result the indexed data contains a disproportionate number of entries for
> this one website.
>
> I suspect that this is happening because this website in question has
> homepage with the maximum number of outlinks.
>
> My questions is how can I control the behaviour of Nutch so as to crawl
> every host/domain in a balanced way.
>
> I am using Nutch 1.7
>
> Thanks.
>



-- 
Don't Grow Old, Grow Up... :-)

Reply via email to