Does HBase really benefit from 64 GB of RAM since allocating too large heap might increase GC time ?
Another question : why not RAID 0, in order to aggregate disk bandwidth ? (and thus keep 3x replication factor) On Wed, Nov 28, 2012 at 5:58 PM, Michael Segel <[email protected]>wrote: > Sorry, > > I need to clarify. > > 4GB per physical core is a good starting point. > So with 2 quad core chips, that is going to be 32GB. > > IMHO that's a minimum. If you go with HBase, you will want more. (Actually > you will need more.) The next logical jump would be to 48 or 64GB. > > If we start to price out memory, depending on vendor, your company's > procurement, there really isn't much of a price difference in terms of > 32,48, or 64 GB. > Note that it also depends on the chips themselves. Also you need to see > how many memory channels exist in the mother board. You may need to buy in > pairs or triplets. Your hardware vendor can help you. (Also you need to > keep an eye on your hardware vendor. Sometimes they will give you higher > density chips that are going to be more expensive...) ;-) > > I tend to like having extra memory from the start. > It gives you a bit more freedom and also protects you from 'fat' code. > > Looking at YARN... you will need more memory too. > > > With respect to the hard drives... > > The best recommendation is to keep the drives as JBOD and then use 3x > replication. > In this case, make sure that the disk controller cards can handle JBOD. > (Some don't support JBOD out of the box) > > With respect to RAID... > > If you are running MapR, no need for RAID. > If you are running an Apache derivative, you could use RAID 1. Then cut > your replication to 2X. This makes it easier to manage drive failures. > (Its not the norm, but it works...) In some clusters, they are using > appliances like Net App's e series where the machines see the drives as > local attached storage and I think the appliances themselves are using > RAID. I haven't played with this configuration, however it could make > sense and its a valid design. > > HTH > > -Mike > > On Nov 28, 2012, at 10:33 AM, Jean-Marc Spaggiari <[email protected]> > wrote: > > > Hi Mike, > > > > Thanks for all those details! > > > > So to simplify the equation, for 16 virtual cores we need 48 to 64GB. > > Which mean 3 to 4GB per core. So with quad cores, 12GB to 16GB are a > > good start? Or I simplified it to much? > > > > Regarding the hard drives. If you add more than one drive, do you need > > to build them on RAID or similar systems? Or can Hadoop/HBase be > > configured to use more than one drive? > > > > Thanks, > > > > JM > > > > 2012/11/27, Michael Segel <[email protected]>: > >> > >> OK... I don't know why Cloudera is so hung up on 32GB. ;-) [Its an > inside > >> joke ...] > >> > >> So here's the problem... > >> > >> By default, your child processes in a map/reduce job get a default > 512MB. > >> The majority of the time, this gets raised to 1GB. > >> > >> 8 cores (dual quad cores) shows up at 16 virtual processors in Linux. > (Note: > >> This is why when people talk about the number of cores, you have to > specify > >> physical cores or logical cores....) > >> > >> So if you were to over subscribe and have lets say 12 mappers and 12 > >> reducers, that's 24 slots. Which means that you would need 24GB of > memory > >> reserved just for the child processes. This would leave 8GB for DN, TT > and > >> the rest of the linux OS processes. > >> > >> Can you live with that? Sure. > >> Now add in R, HBase, Impala, or some other set of tools on top of the > >> cluster. > >> > >> Ooops! Now you are in trouble because you will swap. > >> Also adding in R, you may want to bump up those child procs from 1GB to > 2 > >> GB. That means the 24 slots would now require 48GB. Now you have swap > and > >> if that happens you will see HBase in a cascading failure. > >> > >> So while you can do a rolling restart with the changed configuration > >> (reducing the number of mappers and reducers) you end up with less slots > >> which will mean in longer run time for your jobs. (Less slots == less > >> parallelism ) > >> > >> Looking at the price of memory... you can get 48GB or even 64GB for > around > >> the same price point. (8GB chips) > >> > >> And I didn't even talk about adding SOLR either again a memory hog... > ;-) > >> > >> Note that I matched the number of mappers w reducers. You could go with > >> fewer reducers if you want. I tend to recommend a ratio of 2:1 mappers > to > >> reducers, depending on the work flow.... > >> > >> As to the disks... no 7200 SATA III drives are fine. SATA III interface > is > >> pretty much available in the new kit being shipped. > >> Its just that you don't have enough drives. 8 cores should be 8 > spindles if > >> available. > >> Otherwise you end up seeing your CPU load climb on wait states as the > >> processes wait for the disk i/o to catch up. > >> > >> I mean you could build out a cluster w 4 x 3 3.5" 2TB drives in a 1 U > >> chassis based on price. You're making a trade off and you should be > aware of > >> the performance hit you will take. > >> > >> HTH > >> > >> -Mike > >> > >> On Nov 27, 2012, at 1:52 PM, Jean-Marc Spaggiari < > [email protected]> > >> wrote: > >> > >>> Hi Michael, > >>> > >>> so are you recommanding 32Gb per node? > >>> > >>> What about the disks? SATA drives are to slow? > >>> > >>> JM > >>> > >>> 2012/11/26, Michael Segel <[email protected]>: > >>>> Uhm, those specs are actually now out of date. > >>>> > >>>> If you're running HBase, or want to also run R on top of Hadoop, you > >>>> will > >>>> need to add more memory. > >>>> Also forget 1GBe got 10GBe, and w 2 SATA drives, you will be disk i/o > >>>> bound > >>>> way too quickly. > >>>> > >>>> > >>>> On Nov 26, 2012, at 8:05 AM, Marcos Ortiz <[email protected]> wrote: > >>>> > >>>>> Are you asking about hardware recommendations? > >>>>> Eric Sammer on his "Hadoop Operations" book, did a great job about > >>>>> this: > >>>>> For middle size clusters (until 300 nodes): > >>>>> Processor: A dual quad-core 2.6 Ghz > >>>>> RAM: 24 GB DDR3 > >>>>> Dual 1 Gb Ethernet NICs > >>>>> a SAS drive controller > >>>>> at least two SATA II drives in a JBOD configuration > >>>>> > >>>>> The replication factor depends heavily of the primary use of your > >>>>> cluster. > >>>>> > >>>>> On 11/26/2012 08:53 AM, David Charle wrote: > >>>>>> hi > >>>>>> > >>>>>> what's the recommended nodes for NN, hmaster and zk nodes for a > larger > >>>>>> cluster, lets say 50-100+ > >>>>>> > >>>>>> also, what would be the ideal replication factor for larger clusters > >>>>>> when > >>>>>> u have 3-4 racks ? > >>>>>> > >>>>>> -- > >>>>>> David > >>>>>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS > >>>>>> INFORMATICAS... > >>>>>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION > >>>>>> > >>>>>> http://www.uci.cu > >>>>>> http://www.facebook.com/universidad.uci > >>>>>> http://www.flickr.com/photos/universidad_uci > >>>>> > >>>>> -- > >>>>> > >>>>> Marcos Luis OrtÃz Valmaseda > >>>>> about.me/marcosortiz <http://about.me/marcosortiz> > >>>>> @marcosluis2186 <http://twitter.com/marcosluis2186> > >>>>> > >>>>> > >>>>> > >>>>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS > >>>>> INFORMATICAS... > >>>>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION > >>>>> > >>>>> http://www.uci.cu > >>>>> http://www.facebook.com/universidad.uci > >>>>> http://www.flickr.com/photos/universidad_uci > >>>> > >>>> > >>> > >> > >> > > > > -- Adrien Mogenet 06.59.16.64.22 http://www.mogenet.me
