Re: Performance Configuration on Focused Web Crawl

Ken Krugler Mon, 07 Feb 2011 13:32:50 -0800

Hi Hannes,

I'm curious as to whether you got this configuration running, anyissues you ran into, and what performance you saw.


Thanks,

-- Ken


On Nov 20, 2010, at 10:52am, Hannes Carl Meyer wrote:

Ken, thanks, I guess thats a good hint!
I'm using the simple org.apache.nutch.crawl.Crawl to perform thecrawl - I
guess the configuration of the Map-Reduce Job then is pretty low.
@Andrzej could you give me a hint where to configure the number ofreduce
tasks in nutch 0.9? (running on a single machine)

Regards,

Hannes
On Sat, Nov 20, 2010 at 7:06 PM, Ken Krugler <[email protected]>wrote:
On Nov 20, 2010, at 7:51am, Hannes Carl Meyer wrote:

Thank you for sharing your experiences!
in my case the web servers are pretty stable and we are allowed toperformintensive crawling which make it easy to increase the threads perhost.
imho the fetch process isn't really the bottleneck. It is theprocess
between the fetch process when merging and updating the crawldb.
We are using a 16 Core Hardware, during fetch process CPUs arebeing usedaround 1000 % but in between fetching it is always around 90-100 %on a
single core
In regular map-reduce Hadoop jobs you get this situation if the jobhasbeen configured to use a single reducer, and thus only one core isactive
Though it would surprise me if the crawlDB update job wasconfigured thisway, as I don't see a reason why the crawlDB has to be a singlefile in
HDFS.

Andrzej and others would know best, of course.

-- Ken
On Sat, Nov 20, 2010 at 11:33 AM, Ye T Thet <[email protected]>
wrote:

Hannes,
I guess It would depends on situation
- your server specs (where cralwer is running) and
- hosts specs
Anyway, I have been crawling around 50 hosts. I tweaked a few toget it
right for my situation.

Currently I am using 500 threads. and 10 threads per host.

In my opinion, number of threads for crawler does not matter much.
Because
crawler does not take much of a resource (memory and CPU). As faras your
server network band width can handle, it should be fine.
In my case, number of threads per host matters. Because some ofmy server
cannot handle that much of bandwidth.

Not sure if it would helps, I had to adjust fetcher.server.delay,
fetcher.server.min.delay and fetcher.max.crawl.delay because, myhosts
sometimes cannot handle that much of threads.


Warm Regards,

Y.T. Thet




On Thu, Nov 18, 2010 at 11:06 PM, Hannes Carl Meyer <
[email protected]> wrote:

Hi Ken,
our Crawler is allowed to hit those hosts in a frequent way atnight so
we
are not getting a penalty ;-)
Could you imagine running nutch in this case with about 400threads,
with
1
thread per host and a delay of 1.0?
I tried that way but experienced some really long idle times...My idea
was
one thread per host. That would mean adding another host wouldrequire
add
an additional thread.

Regards

Hannes

On Thu, Nov 18, 2010 at 3:36 PM, Ken Krugler <
[email protected]
wrote:
If you're hitting each host with 45 threads, you better be onreally
good
terms with those webmasters :)
With 90 total threads, that means as few as 2 hosts are activeat any
time,
yes?

-- Ken



On Nov 18, 2010, at 3:51am, Hannes Carl Meyer wrote:

Hi,
I'm using nutch 0.9 to crawl about 400 hosts with an averageof 600
pages.
That makes a volume of 240.000 fetched pages - I want to getall of
them.
Can one give me an advice on the right threads/delay/per-host
configuration
in this environnement?

My current conf:

<property>
   <name>fetcher.server.delay</name>
   <value>1.0</value>
</property>

<property>
   <name>fetcher.threads.fetch</name>
   <value>90</value>
</property>

<property>
   <name>fetcher.threads.per.host</name>
   <value>45</value>
</property>

<property>
 <name>fetcher.threads.per.host.by.ip</name>
 <value>false</value>
</property>

The total runtime is about 5 hours.
How can performance be improved? (I still have enough CPU,Bandwith)
Note: This runs on a single machine, distribution to othermachines is
not
planned.
Thanks and Regards

Hannes
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g


--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Performance Configuration on Focused Web Crawl

Reply via email to