I'm going to give it a try and confgure a peudo-distributed env on our testing machine (which also has 16 Cores and 24 GB RAM).
I'll get back here after testing it! On Sat, Nov 20, 2010 at 10:53 PM, Ken Krugler <[email protected]>wrote: > [snip] > > > During fetching, multiple threads are spawned which will then use all >>> your cores. >>> >>> But during regular map-reduce tasks (such as the CrawlDB update), you'll >>> get a single map and a single reduce running sequentially. >>> >> >> (Actually, LocalJobTracker will create multiple map tasks - as many as >> there are input splits - but running sequentially). >> > > Sorry, I was being vague in my wording. I meant one mapper and one reducer, > which won't be run in parallel. > > You're right that there will be N map tasks, one per split (which typically > means one per HDFS block). > > > To get reasonable performance from one box, you'd need to set up Hadoop >>> to run in pseudo-distributed mode, and then run your Nutch crawl as a >>> regular/distributed job. >>> >>> And also tweak the hadoop-site.xml settings, to specify something like 6 >>> mappers and 6 reducers (leave four cores for JobTracker, NameNode, >>> TaskTracker, DataNode). >>> >>> But I'll confess, I've never tried to run a real job this way. >>> >> >> I have. Within the limits of a single machine performance this works >> reasonably well - if you have a node with 4 cores and enough RAM then >> you can easily run 4 tasks in parallel. Jobs become then limited by the >> amount of disk IO. >> > > I'd be interested in hearing back from Hannes as to performance with a > 16-core box. Based on the paper by the IRLBot team, it seems like this could > scale pretty well. > > They did wind up having to install a lot of disks in their crawling box. > And as Andrzej mentions, disk I/O will become a bottleneck, especially for > crawlDB updates (less so for fetching or parsing). > > If you have multiple drives, then you could run multiple DataNodes, and > configure each one to use a separate disk. > > I don't have a good sense of whether it would be worthwhile to use > replication, but in the past Hadoop had some issues running with a > replication of 1, so I'd probably set this to 2. > > -- Ken > > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > > >

