So if I run a heap size of 512, would it be a good idea to set a max of 5 map and 5 reduce tasks or should I set more there and run a tasktracker max of 10?
________________________________ From: Andrzej Bialecki [mailto:[email protected]] Sent: Fri 12/10/2010 1:38 AM To: [email protected] Subject: Re: Cluster Design Questions On 2010-12-10 05:45, Chris Woolum wrote: > Hello Everyone, > > > > I have just gotten a basic Nutch/Hadoop configuration going. My > configuration is 3 Dell 2850's. Each have dual 3.6 ghz processors. The > Master node has 8Gb ram and the two Slaves have 5Gb each. > > The hard drive configuration on the master is dual 128Gb 15k rpm In a > raid 0 configuration. > > > > I have each of the slave machines set up with VMWare Esxi and each hosts > 4 virtual nutch crawlers each getting 1.2 Gb ram and its own 73Gb 10000 > RPM scsi drive giving me a total of 8 slaves. > > > > I was doing research and wondering if it would be more effective to just > run the two slave servers without the virtualization each having 5Gb ram > and a larger raid 0?? I was also wondering what settings I can use to > maximize the memory usage on the master? I am currently using Rsync > because I am still adding plugins and it makes it easier to deploy the > plugins to all the machines but if I need to disable it to have a > customized configuration on the master node, that is fine. In my opinion the virtualization layer is not necessary here, unless you need it for some other reasons (like researching the impact of virtualization on Hadoop performance ;) ). Instead, you should simply increase the max number of allowed tasks per tasktracker, and increase the default number of map/reduce tasks (keeping in mind the recommended formulae regarding the number of nodes). A simplistic rule of thumb for the max tasks per tasktracker is the amount of available RAM divided by the heap size that you set in mapred.child.java.opts - this guarantees that all tasks will fit in RAM, and of course you can overcommit the number of tasks, if on average tasks don't reach their max heap size. Master node in a Hadoop cluster doesn't need a super-powerful CPU - after all, the heavy work is done on the slaves - but the namenode (usually started on the master node) needs more RAM if you work with many files. On the other hand, neither jobtracker nor tasktrackers need a large heapsize (default is 1GB IIRC, which is usually too much). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com <http://www.sigram.com/> Contact: info at sigram dot com

