Ian,

Regarding your first point, I understand where the concern is coming
from but I'd like to point out that with the new MemStore-Local
Allocation Buffers the full GCs taking minutes might not be as much as
an issue as it used to be. That said, I haven't tested that out yet
and I don't know of anyone that did it.

Your second point is dead-on. Also not only it takes time to
replicate, but it can also steal precious IO and in 0.20 it's pretty
much impossible to limit the rate of re-replication.

J-D

On Mon, May 2, 2011 at 7:30 AM, Ian Roughley <[email protected]> wrote:
> I think that there are two important considerations:
> 1. Can the JVM you're planning on using support a heap of > 10GB, if not, 
> you're wasting money
> 2. Putting more disk on nodes, means that a failure will take longer to 
> re-replicate back to it's
> balanced state.  i.e. Given you're network topology, how long will even a 
> 50TB machine take, a day a
> week, longer?
>
> /Ian
> Architect / Mgr - Novell Vibe
>
> On 05/02/2011 09:57 AM, Michael Segel wrote:
>>
>> Hi,
>>
>> That's actually a really good question.
>> Unfortunately, the answer isn't really simple.
>>
>> You're going to need to estimate your growth and you're going to need to 
>> estimate your configuration.
>>
>> Suppose I know that within 2 years, the amount of data that I want to retain 
>> is going to be 1PB, with a 3x replication factor, I'll need at least 3PB of 
>> disk. Assuming that I can fit 12x2TB drives in a node, I'll need 125-150 
>> machines. (There's some overhead for logging and OS)
>>
>> Now this doesn't mean that I'll need to buy all of the machines today and 
>> build out the cluster.
>> It means that I will need to figure out my machine room, (rack space, power, 
>> etc...) and also hardware configuration.
>>
>> You'll also need to plan out your hardware choices too. An example.. you may 
>> want 10GBe on the switch but not at the data node. However you're going to 
>> want to be able to expand your data nodes to be able to add 10GBe cards.
>>
>> The idea is that as I build out my cluster, all of the machines have the 
>> same look and feel. So if you buy quad core CPUs and they are 2.2 GHz but 6 
>> months from now, you buy 2.6 GHz cpus, as long as they are 4 core cpus, your 
>> cluster will look the same.
>>
>> The point is that when you lay out your cluster to start with, you'll need 
>> to plan ahead and keep things similar. Also you'll need to make sure your 
>> NameNode has enough memory...
>>
>> Having said that... Yahoo! has written a paper detailing MR2 (next 
>> generation of map/reduce).  As the M/R Job scheduler becomes more 
>> intelligent about the types of jobs and types of hardware, the consistency 
>> of hardware becomes less important.
>>
>> With respect to HBase, I suspect there to be a parallel evolution.
>>
>> As to building out and replacing your cluster... if this is a production 
>> environment, you'll have to think about DR and building out a second 
>> cluster. So the cost of replacing clusters should also be factored in when 
>> you budget for hardware.
>>
>> Like I said, its not a simple answer and you have to approach each instance 
>> separately and fine tune your cluster plans.
>>
>> HTH
>>
>> -Mike
>>
>>
>> ----------------------------------------
>>> Date: Mon, 2 May 2011 09:53:05 +0300
>>> From: [email protected]
>>> To: [email protected]
>>> CC: [email protected]
>>> Subject: Re: Hardware configuration
>>>
>>> Thank you both. How would you estimate really big clusters, with
>>> hundreds of nodes? Requirements might change in time and replacing an
>>> entire cluster seems not the best solution...
>>>
>>>
>>>
>>> On 04/29/2011 07:08 PM, Stack wrote:
>>>> I agree with Michel Segel. Distributed computing is hard enough.
>>>> There is no need to add extra complexity.
>>>>
>>>> St.Ack
>>>>
>>>> On Fri, Apr 29, 2011 at 4:05 AM, Iulia Zidaru wrote:
>>>>> Hi,
>>>>> I'm wondering if having a cluster with different machines in terms of CPU,
>>>>> RAM and disk space would be a big issue for HBase. For example, machines
>>>>> with 12GBs RAM and machines with 48GBs. We suppose that we use them at 
>>>>> full
>>>>> capacity. What problems we might encounter if having this kind of
>>>>> configuration?
>>>>> Thank you,
>>>>> Iulia
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Iulia Zidaru
>>> Java Developer
>>>
>>> 1&1 Internet AG - Bucharest/Romania - Web Components Romania
>>> 18 Mircea Eliade St
>>> Sect 1, Bucharest
>>> RO Bucharest, 012015
>>> [email protected]
>>> 0040 31 223 9153
>>>
>>>
>>>
>>
>
>

Reply via email to