Re: Performance Configuration on Focused Web Crawl

Andrzej Bialecki Sat, 20 Nov 2010 12:21:17 -0800

On 2010-11-20 21:02, Ken Krugler wrote:

>> @Andrzej could you give me a hint where to configure the number of reduce
>> tasks in nutch 0.9? (running on a single machine)


This is not possible in local mode. In local mode all map tasks are run
sequentially, and there is always 1 reduce. As Ken points out, you need
to run at least in pseudo-distributed mode, i.e. using a real
JobTracker/TaskTracker on a single machine.

> 
> Sounds like you're running in local mode.
> 
> During fetching, multiple threads are spawned which will then use all
> your cores.
> 
> But during regular map-reduce tasks (such as the CrawlDB update), you'll
> get a single map and a single reduce running sequentially.

(Actually, LocalJobTracker will create multiple map tasks - as many as
there are input splits - but running sequentially).

> 
> To get reasonable performance from one box, you'd need to set up Hadoop
> to run in pseudo-distributed mode, and then run your Nutch crawl as a
> regular/distributed job.
> 
> And also tweak the hadoop-site.xml settings, to specify something like 6
> mappers and 6 reducers (leave four cores for JobTracker, NameNode,
> TaskTracker, DataNode).
> 
> But I'll confess, I've never tried to run a real job this way.

I have. Within the limits of a single machine performance this works
reasonably well - if you have a node with 4 cores and enough RAM then
you can easily run 4 tasks in parallel. Jobs become then limited by the
amount of disk IO.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Performance Configuration on Focused Web Crawl

Reply via email to