Sorry, 

I need to clarify. 

4GB per physical core is a good starting point. 
So with 2 quad core chips, that is going to be 32GB. 

IMHO that's a minimum. If you go with HBase, you will want more. (Actually you 
will need more.) The next logical jump would be to 48 or 64GB. 

If we start to price out memory, depending on vendor, your company's 
procurement,  there really isn't much of a price difference in terms of 32,48, 
or 64 GB. 
Note that it also depends on the chips themselves. Also you need to see how 
many memory channels exist in the mother board. You may need to buy in pairs or 
triplets. Your hardware vendor can help you. (Also you need to keep an eye on 
your hardware vendor. Sometimes they will give you higher density chips that 
are going to be more expensive...) ;-) 

I tend to like having extra memory from the start.  
It gives you a bit more freedom and also protects you from 'fat' code. 

Looking at YARN... you will need more memory too. 


With respect to the hard drives... 

The best recommendation is to keep the drives as JBOD and then use 3x 
replication. 
In this case, make sure that the disk controller cards can handle JBOD. (Some 
don't support JBOD out of the box) 

With respect to RAID... 

If you are running MapR, no need for RAID. 
If you are running an Apache derivative, you could use RAID 1. Then cut your 
replication to 2X. This makes it easier to manage drive failures. 
(Its not the norm, but it works...) In some clusters, they are using appliances 
like Net App's e series where the machines see the drives as local attached 
storage and I think the appliances themselves are using RAID.  I haven't played 
with this configuration, however it could make sense and its a valid design. 

HTH

-Mike

On Nov 28, 2012, at 10:33 AM, Jean-Marc Spaggiari <[email protected]> 
wrote:

> Hi Mike,
> 
> Thanks for all those details!
> 
> So to simplify the equation, for 16 virtual cores we need 48 to 64GB.
> Which mean 3 to 4GB per core. So with quad cores, 12GB to 16GB are a
> good start? Or I simplified it to much?
> 
> Regarding the hard drives. If you add more than one drive, do you need
> to build them on RAID or similar systems? Or can Hadoop/HBase be
> configured to use more than one drive?
> 
> Thanks,
> 
> JM
> 
> 2012/11/27, Michael Segel <[email protected]>:
>> 
>> OK... I don't know why Cloudera is so hung up on 32GB. ;-) [Its an inside
>> joke ...]
>> 
>> So here's the problem...
>> 
>> By default, your child processes in a map/reduce job get a default 512MB.
>> The majority of the time, this gets raised to 1GB.
>> 
>> 8 cores (dual quad cores) shows up at 16 virtual processors in Linux. (Note:
>> This is why when people talk about the number of cores, you have to specify
>> physical cores or logical cores....)
>> 
>> So if you were to over subscribe and have lets say 12  mappers and 12
>> reducers, that's 24 slots. Which means that you would need 24GB of memory
>> reserved just for the child processes. This would leave 8GB for DN, TT and
>> the rest of the linux OS processes.
>> 
>> Can you live with that? Sure.
>> Now add in R, HBase, Impala, or some other set of tools on top of the
>> cluster.
>> 
>> Ooops! Now you are in trouble because you will swap.
>> Also adding in R, you may want to bump up those child procs from 1GB to 2
>> GB. That means the 24 slots would now require 48GB.  Now you have swap and
>> if that happens you will see HBase in a cascading failure.
>> 
>> So while you can do a rolling restart with the changed configuration
>> (reducing the number of mappers and reducers) you end up with less slots
>> which will mean in longer run time for your jobs. (Less slots == less
>> parallelism )
>> 
>> Looking at the price of memory... you can get 48GB or even 64GB  for around
>> the same price point. (8GB chips)
>> 
>> And I didn't even talk about adding SOLR either again a memory hog... ;-)
>> 
>> Note that I matched the number of mappers w reducers. You could go with
>> fewer reducers if you want. I tend to recommend a ratio of 2:1 mappers to
>> reducers, depending on the work flow....
>> 
>> As to the disks... no 7200 SATA III drives are fine. SATA III interface is
>> pretty much available in the new kit being shipped.
>> Its just that you don't have enough drives. 8 cores should be 8 spindles if
>> available.
>> Otherwise you end up seeing your CPU load climb on wait states as the
>> processes wait for the disk i/o to catch up.
>> 
>> I mean you could build out a cluster w 4 x 3 3.5" 2TB drives in a 1 U
>> chassis based on price. You're making a trade off and you should be aware of
>> the performance hit you will take.
>> 
>> HTH
>> 
>> -Mike
>> 
>> On Nov 27, 2012, at 1:52 PM, Jean-Marc Spaggiari <[email protected]>
>> wrote:
>> 
>>> Hi Michael,
>>> 
>>> so are you recommanding 32Gb per node?
>>> 
>>> What about the disks? SATA drives are to slow?
>>> 
>>> JM
>>> 
>>> 2012/11/26, Michael Segel <[email protected]>:
>>>> Uhm, those specs are actually now out of date.
>>>> 
>>>> If you're running HBase, or want to also run R on top of Hadoop, you
>>>> will
>>>> need to add more memory.
>>>> Also forget 1GBe got 10GBe,  and w 2 SATA drives, you will be disk i/o
>>>> bound
>>>> way too quickly.
>>>> 
>>>> 
>>>> On Nov 26, 2012, at 8:05 AM, Marcos Ortiz <[email protected]> wrote:
>>>> 
>>>>> Are you asking about hardware recommendations?
>>>>> Eric Sammer on his "Hadoop Operations" book, did a great job about
>>>>> this:
>>>>> For middle size clusters (until 300 nodes):
>>>>> Processor: A dual quad-core 2.6 Ghz
>>>>> RAM: 24 GB DDR3
>>>>> Dual 1 Gb Ethernet NICs
>>>>> a SAS drive controller
>>>>> at least two SATA II drives in a JBOD configuration
>>>>> 
>>>>> The replication factor depends heavily of the primary use of your
>>>>> cluster.
>>>>> 
>>>>> On 11/26/2012 08:53 AM, David Charle wrote:
>>>>>> hi
>>>>>> 
>>>>>> what's the recommended nodes for NN, hmaster and zk nodes for a larger
>>>>>> cluster, lets say 50-100+
>>>>>> 
>>>>>> also, what would be the ideal replication factor for larger clusters
>>>>>> when
>>>>>> u have 3-4 racks ?
>>>>>> 
>>>>>> --
>>>>>> David
>>>>>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
>>>>>> INFORMATICAS...
>>>>>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>>>>>> 
>>>>>> http://www.uci.cu
>>>>>> http://www.facebook.com/universidad.uci
>>>>>> http://www.flickr.com/photos/universidad_uci
>>>>> 
>>>>> --
>>>>> 
>>>>> Marcos Luis Ortíz Valmaseda
>>>>> about.me/marcosortiz <http://about.me/marcosortiz>
>>>>> @marcosluis2186 <http://twitter.com/marcosluis2186>
>>>>> 
>>>>> 
>>>>> 
>>>>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
>>>>> INFORMATICAS...
>>>>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>>>>> 
>>>>> http://www.uci.cu
>>>>> http://www.facebook.com/universidad.uci
>>>>> http://www.flickr.com/photos/universidad_uci
>>>> 
>>>> 
>>> 
>> 
>> 
> 

Reply via email to