Hi Ken,

our Crawler is allowed to hit those hosts in a frequent way at night so we
are not getting a penalty ;-)

Could you imagine running nutch in this case with about 400 threads, with 1
thread per host and a delay of 1.0?

I tried that way but experienced some really long idle times... My idea was
one thread per host. That would mean adding another host would require add
an additional thread.

Regards

Hannes

On Thu, Nov 18, 2010 at 3:36 PM, Ken Krugler <[email protected]>wrote:

> If you're hitting each host with 45 threads, you better be on really good
> terms with those webmasters :)
>
> With 90 total threads, that means as few as 2 hosts are active at any time,
> yes?
>
> -- Ken
>
>
>
> On Nov 18, 2010, at 3:51am, Hannes Carl Meyer wrote:
>
>  Hi,
>> I'm using nutch 0.9 to crawl about 400 hosts with an average of 600 pages.
>> That makes a volume of 240.000 fetched pages - I want to get all of them.
>>
>> Can one give me an advice on the right threads/delay/per-host
>> configuration
>> in this environnement?
>>
>> My current conf:
>>
>> <property>
>>       <name>fetcher.server.delay</name>
>>       <value>1.0</value>
>> </property>
>>
>> <property>
>>       <name>fetcher.threads.fetch</name>
>>       <value>90</value>
>> </property>
>>
>> <property>
>>       <name>fetcher.threads.per.host</name>
>>       <value>45</value>
>> </property>
>>
>> <property>
>>     <name>fetcher.threads.per.host.by.ip</name>
>>     <value>false</value>
>> </property>
>>
>> The total runtime is about 5 hours.
>>
>> How can performance be improved? (I still have enough CPU, Bandwith)
>>
>> Note: This runs on a single machine, distribution to other machines is not
>> planned.
>>
>> Thanks and Regards
>>
>> Hannes
>>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>
>

Reply via email to