[snip]

During fetching, multiple threads are spawned which will then use all
your cores.

But during regular map-reduce tasks (such as the CrawlDB update), you'll
get a single map and a single reduce running sequentially.

(Actually, LocalJobTracker will create multiple map tasks - as many as
there are input splits - but running sequentially).

Sorry, I was being vague in my wording. I meant one mapper and one reducer, which won't be run in parallel.

You're right that there will be N map tasks, one per split (which typically means one per HDFS block).

To get reasonable performance from one box, you'd need to set up Hadoop
to run in pseudo-distributed mode, and then run your Nutch crawl as a
regular/distributed job.

And also tweak the hadoop-site.xml settings, to specify something like 6
mappers and 6 reducers (leave four cores for JobTracker, NameNode,
TaskTracker, DataNode).

But I'll confess, I've never tried to run a real job this way.

I have. Within the limits of a single machine performance this works
reasonably well - if you have a node with 4 cores and enough RAM then
you can easily run 4 tasks in parallel. Jobs become then limited by the
amount of disk IO.

I'd be interested in hearing back from Hannes as to performance with a 16-core box. Based on the paper by the IRLBot team, it seems like this could scale pretty well.

They did wind up having to install a lot of disks in their crawling box. And as Andrzej mentions, disk I/O will become a bottleneck, especially for crawlDB updates (less so for fetching or parsing).

If you have multiple drives, then you could run multiple DataNodes, and configure each one to use a separate disk.

I don't have a good sense of whether it would be worthwhile to use replication, but in the past Hadoop had some issues running with a replication of 1, so I'd probably set this to 2.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to