Re: Performance Configuration on Focused Web Crawl

Ken Krugler Sat, 20 Nov 2010 13:54:07 -0800

[snip]

During fetching, multiple threads are spawned which will then use all
your cores.
But during regular map-reduce tasks (such as the CrawlDB update),you'll
get a single map and a single reduce running sequentially.


(Actually, LocalJobTracker will create multiple map tasks - as many as
there are input splits - but running sequentially).

Sorry, I was being vague in my wording. I meant one mapper and onereducer, which won't be run in parallel.

You're right that there will be N map tasks, one per split (whichtypically means one per HDFS block).

To get reasonable performance from one box, you'd need to set upHadoop
to run in pseudo-distributed mode, and then run your Nutch crawl as a
regular/distributed job.
And also tweak the hadoop-site.xml settings, to specify somethinglike 6
mappers and 6 reducers (leave four cores for JobTracker, NameNode,
TaskTracker, DataNode).

But I'll confess, I've never tried to run a real job this way.
I have. Within the limits of a single machine performance this works
reasonably well - if you have a node with 4 cores and enough RAM then
you can easily run 4 tasks in parallel. Jobs become then limited bythe
amount of disk IO.

I'd be interested in hearing back from Hannes as to performance with a16-core box. Based on the paper by the IRLBot team, it seems like thiscould scale pretty well.

They did wind up having to install a lot of disks in their crawlingbox. And as Andrzej mentions, disk I/O will become a bottleneck,especially for crawlDB updates (less so for fetching or parsing).

If you have multiple drives, then you could run multiple DataNodes,and configure each one to use a separate disk.

I don't have a good sense of whether it would be worthwhile to usereplication, but in the past Hadoop had some issues running with areplication of 1, so I'd probably set this to 2.


-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Performance Configuration on Focused Web Crawl

Reply via email to