[snip]
During fetching, multiple threads are spawned which will then use all
your cores.
But during regular map-reduce tasks (such as the CrawlDB update),
you'll
get a single map and a single reduce running sequentially.
(Actually, LocalJobTracker will create multiple map tasks - as many as
there are input splits - but running sequentially).
Sorry, I was being vague in my wording. I meant one mapper and one
reducer, which won't be run in parallel.
You're right that there will be N map tasks, one per split (which
typically means one per HDFS block).
To get reasonable performance from one box, you'd need to set up
Hadoop
to run in pseudo-distributed mode, and then run your Nutch crawl as a
regular/distributed job.
And also tweak the hadoop-site.xml settings, to specify something
like 6
mappers and 6 reducers (leave four cores for JobTracker, NameNode,
TaskTracker, DataNode).
But I'll confess, I've never tried to run a real job this way.
I have. Within the limits of a single machine performance this works
reasonably well - if you have a node with 4 cores and enough RAM then
you can easily run 4 tasks in parallel. Jobs become then limited by
the
amount of disk IO.
I'd be interested in hearing back from Hannes as to performance with a
16-core box. Based on the paper by the IRLBot team, it seems like this
could scale pretty well.
They did wind up having to install a lot of disks in their crawling
box. And as Andrzej mentions, disk I/O will become a bottleneck,
especially for crawlDB updates (less so for fetching or parsing).
If you have multiple drives, then you could run multiple DataNodes,
and configure each one to use a separate disk.
I don't have a good sense of whether it would be worthwhile to use
replication, but in the past Hadoop had some issues running with a
replication of 1, so I'd probably set this to 2.
-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g