Thank you for sharing your experiences!

in my case the web servers are pretty stable and we are allowed to perform
intensive crawling which make it easy to increase the threads per host.

imho the fetch process isn't really the bottleneck. It is the process
between the fetch process when merging and updating the crawldb.

We are using a 16 Core Hardware, during fetch process CPUs are being used
around 1000 % but in between fetching it is always around 90-100 % on a
single core

On Sat, Nov 20, 2010 at 11:33 AM, Ye T Thet <[email protected]> wrote:

> Hannes,
>
> I guess It would depends on situation
> - your server specs (where cralwer is running) and
> - hosts specs
>
> Anyway, I have been crawling around 50 hosts. I tweaked a few to get it
> right for my situation.
>
> Currently I am using 500 threads. and 10 threads per host.
>
> In my opinion, number of threads for crawler does not matter much. Because
> crawler does not take much of a resource (memory and CPU). As far as your
> server network band width can handle, it should be fine.
>
> In my case, number of threads per host matters. Because some of my server
> cannot handle that much of bandwidth.
>
> Not sure if it would helps, I had to adjust fetcher.server.delay,
> fetcher.server.min.delay and fetcher.max.crawl.delay because, my hosts
> sometimes cannot handle that much of threads.
>
>
> Warm Regards,
>
> Y.T. Thet
>
>
>
>
> On Thu, Nov 18, 2010 at 11:06 PM, Hannes Carl Meyer <
> [email protected]> wrote:
>
>> Hi Ken,
>>
>> our Crawler is allowed to hit those hosts in a frequent way at night so we
>> are not getting a penalty ;-)
>>
>> Could you imagine running nutch in this case with about 400 threads, with
>> 1
>> thread per host and a delay of 1.0?
>>
>> I tried that way but experienced some really long idle times... My idea
>> was
>> one thread per host. That would mean adding another host would require add
>> an additional thread.
>>
>> Regards
>>
>> Hannes
>>
>> On Thu, Nov 18, 2010 at 3:36 PM, Ken Krugler <[email protected]
>> >wrote:
>>
>> > If you're hitting each host with 45 threads, you better be on really
>> good
>> > terms with those webmasters :)
>> >
>> > With 90 total threads, that means as few as 2 hosts are active at any
>> time,
>> > yes?
>> >
>> > -- Ken
>> >
>> >
>> >
>> > On Nov 18, 2010, at 3:51am, Hannes Carl Meyer wrote:
>> >
>> >  Hi,
>> >> I'm using nutch 0.9 to crawl about 400 hosts with an average of 600
>> pages.
>> >> That makes a volume of 240.000 fetched pages - I want to get all of
>> them.
>> >>
>> >> Can one give me an advice on the right threads/delay/per-host
>> >> configuration
>> >> in this environnement?
>> >>
>> >> My current conf:
>> >>
>> >> <property>
>> >>       <name>fetcher.server.delay</name>
>> >>       <value>1.0</value>
>> >> </property>
>> >>
>> >> <property>
>> >>       <name>fetcher.threads.fetch</name>
>> >>       <value>90</value>
>> >> </property>
>> >>
>> >> <property>
>> >>       <name>fetcher.threads.per.host</name>
>> >>       <value>45</value>
>> >> </property>
>> >>
>> >> <property>
>> >>     <name>fetcher.threads.per.host.by.ip</name>
>> >>     <value>false</value>
>> >> </property>
>> >>
>> >> The total runtime is about 5 hours.
>> >>
>> >> How can performance be improved? (I still have enough CPU, Bandwith)
>> >>
>> >> Note: This runs on a single machine, distribution to other machines is
>> not
>> >> planned.
>> >>
>> >> Thanks and Regards
>> >>
>> >> Hannes
>> >>
>> >
>> > --------------------------
>> > Ken Krugler
>> > +1 530-210-6378
>> > http://bixolabs.com
>> > e l a s t i c   w e b   m i n i n g
>> >
>> >
>> >
>> >
>> >
>> >
>>
>
>

Reply via email to