> So I have Nutch running on a hadoop cluster with three data nodes. The > machines are all pretty beefy, but Nutch isn't performing any faster than > when I was running in pseudo mode on one machine. How to I set Nutch in > order to take full advantage of the cluster? >
Having beefy machines is not going to be very useful for the fetching step which is IO bound and usually takes most of the time. How big is your crawldb? How long do the generate / parse and update steps take? Having more than one machine won't make a massive difference if your crawldb or segments are small. Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

