I think that there are two important considerations:
1. Can the JVM you're planning on using support a heap of > 10GB, if not, 
you're wasting money
2. Putting more disk on nodes, means that a failure will take longer to 
re-replicate back to it's
balanced state.  i.e. Given you're network topology, how long will even a 50TB 
machine take, a day a
week, longer?

/Ian
Architect / Mgr - Novell Vibe

On 05/02/2011 09:57 AM, Michael Segel wrote:
> 
> Hi,
> 
> That's actually a really good question.
> Unfortunately, the answer isn't really simple.
> 
> You're going to need to estimate your growth and you're going to need to 
> estimate your configuration.
> 
> Suppose I know that within 2 years, the amount of data that I want to retain 
> is going to be 1PB, with a 3x replication factor, I'll need at least 3PB of 
> disk. Assuming that I can fit 12x2TB drives in a node, I'll need 125-150 
> machines. (There's some overhead for logging and OS)
> 
> Now this doesn't mean that I'll need to buy all of the machines today and 
> build out the cluster.
> It means that I will need to figure out my machine room, (rack space, power, 
> etc...) and also hardware configuration.
> 
> You'll also need to plan out your hardware choices too. An example.. you may 
> want 10GBe on the switch but not at the data node. However you're going to 
> want to be able to expand your data nodes to be able to add 10GBe cards.
> 
> The idea is that as I build out my cluster, all of the machines have the same 
> look and feel. So if you buy quad core CPUs and they are 2.2 GHz but 6 months 
> from now, you buy 2.6 GHz cpus, as long as they are 4 core cpus, your cluster 
> will look the same.
> 
> The point is that when you lay out your cluster to start with, you'll need to 
> plan ahead and keep things similar. Also you'll need to make sure your 
> NameNode has enough memory...
> 
> Having said that... Yahoo! has written a paper detailing MR2 (next generation 
> of map/reduce).  As the M/R Job scheduler becomes more intelligent about the 
> types of jobs and types of hardware, the consistency of hardware becomes less 
> important. 
> 
> With respect to HBase, I suspect there to be a parallel evolution.
> 
> As to building out and replacing your cluster... if this is a production 
> environment, you'll have to think about DR and building out a second cluster. 
> So the cost of replacing clusters should also be factored in when you budget 
> for hardware.
> 
> Like I said, its not a simple answer and you have to approach each instance 
> separately and fine tune your cluster plans.
> 
> HTH
> 
> -Mike
> 
> 
> ----------------------------------------
>> Date: Mon, 2 May 2011 09:53:05 +0300
>> From: [email protected]
>> To: [email protected]
>> CC: [email protected]
>> Subject: Re: Hardware configuration
>>
>> Thank you both. How would you estimate really big clusters, with
>> hundreds of nodes? Requirements might change in time and replacing an
>> entire cluster seems not the best solution...
>>
>>
>>
>> On 04/29/2011 07:08 PM, Stack wrote:
>>> I agree with Michel Segel. Distributed computing is hard enough.
>>> There is no need to add extra complexity.
>>>
>>> St.Ack
>>>
>>> On Fri, Apr 29, 2011 at 4:05 AM, Iulia Zidaru wrote:
>>>> Hi,
>>>> I'm wondering if having a cluster with different machines in terms of CPU,
>>>> RAM and disk space would be a big issue for HBase. For example, machines
>>>> with 12GBs RAM and machines with 48GBs. We suppose that we use them at full
>>>> capacity. What problems we might encounter if having this kind of
>>>> configuration?
>>>> Thank you,
>>>> Iulia
>>>>
>>>>
>>
>>
>> --
>> Iulia Zidaru
>> Java Developer
>>
>> 1&1 Internet AG - Bucharest/Romania - Web Components Romania
>> 18 Mircea Eliade St
>> Sect 1, Bucharest
>> RO Bucharest, 012015
>> [email protected]
>> 0040 31 223 9153
>>
>>
>>
>                                         

Reply via email to