On 2010-11-20 21:02, Ken Krugler wrote: >> @Andrzej could you give me a hint where to configure the number of reduce >> tasks in nutch 0.9? (running on a single machine)
This is not possible in local mode. In local mode all map tasks are run sequentially, and there is always 1 reduce. As Ken points out, you need to run at least in pseudo-distributed mode, i.e. using a real JobTracker/TaskTracker on a single machine. > > Sounds like you're running in local mode. > > During fetching, multiple threads are spawned which will then use all > your cores. > > But during regular map-reduce tasks (such as the CrawlDB update), you'll > get a single map and a single reduce running sequentially. (Actually, LocalJobTracker will create multiple map tasks - as many as there are input splits - but running sequentially). > > To get reasonable performance from one box, you'd need to set up Hadoop > to run in pseudo-distributed mode, and then run your Nutch crawl as a > regular/distributed job. > > And also tweak the hadoop-site.xml settings, to specify something like 6 > mappers and 6 reducers (leave four cores for JobTracker, NameNode, > TaskTracker, DataNode). > > But I'll confess, I've never tried to run a real job this way. I have. Within the limits of a single machine performance this works reasonably well - if you have a node with 4 cores and enough RAM then you can easily run 4 tasks in parallel. Jobs become then limited by the amount of disk IO. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

