Sorry - I meant to answer Iulia, not Michael.  I was speaking more generally, 
as there is also no
guarantee that MR jobs are running.  So perhaps I should add in deployment / 
running server
architecture.

/Ian

On 05/02/2011 01:47 PM, Jean-Daniel Cryans wrote:
> Ian,
> 
> Regarding your first point, I understand where the concern is coming
> from but I'd like to point out that with the new MemStore-Local
> Allocation Buffers the full GCs taking minutes might not be as much as
> an issue as it used to be. That said, I haven't tested that out yet
> and I don't know of anyone that did it.
> 
> Your second point is dead-on. Also not only it takes time to
> replicate, but it can also steal precious IO and in 0.20 it's pretty
> much impossible to limit the rate of re-replication.
> 
> J-D
> 
> On Mon, May 2, 2011 at 7:30 AM, Ian Roughley <[email protected]> wrote:
>> I think that there are two important considerations:
>> 1. Can the JVM you're planning on using support a heap of > 10GB, if not, 
>> you're wasting money
>> 2. Putting more disk on nodes, means that a failure will take longer to 
>> re-replicate back to it's
>> balanced state.  i.e. Given you're network topology, how long will even a 
>> 50TB machine take, a day a
>> week, longer?
>>
>> /Ian
>> Architect / Mgr - Novell Vibe
>>
>> On 05/02/2011 09:57 AM, Michael Segel wrote:
>>>
>>> Hi,
>>>
>>> That's actually a really good question.
>>> Unfortunately, the answer isn't really simple.
>>>
>>> You're going to need to estimate your growth and you're going to need to 
>>> estimate your configuration.
>>>
>>> Suppose I know that within 2 years, the amount of data that I want to 
>>> retain is going to be 1PB, with a 3x replication factor, I'll need at least 
>>> 3PB of disk. Assuming that I can fit 12x2TB drives in a node, I'll need 
>>> 125-150 machines. (There's some overhead for logging and OS)
>>>
>>> Now this doesn't mean that I'll need to buy all of the machines today and 
>>> build out the cluster.
>>> It means that I will need to figure out my machine room, (rack space, 
>>> power, etc...) and also hardware configuration.
>>>
>>> You'll also need to plan out your hardware choices too. An example.. you 
>>> may want 10GBe on the switch but not at the data node. However you're going 
>>> to want to be able to expand your data nodes to be able to add 10GBe cards.
>>>
>>> The idea is that as I build out my cluster, all of the machines have the 
>>> same look and feel. So if you buy quad core CPUs and they are 2.2 GHz but 6 
>>> months from now, you buy 2.6 GHz cpus, as long as they are 4 core cpus, 
>>> your cluster will look the same.
>>>
>>> The point is that when you lay out your cluster to start with, you'll need 
>>> to plan ahead and keep things similar. Also you'll need to make sure your 
>>> NameNode has enough memory...
>>>
>>> Having said that... Yahoo! has written a paper detailing MR2 (next 
>>> generation of map/reduce).  As the M/R Job scheduler becomes more 
>>> intelligent about the types of jobs and types of hardware, the consistency 
>>> of hardware becomes less important.
>>>
>>> With respect to HBase, I suspect there to be a parallel evolution.
>>>
>>> As to building out and replacing your cluster... if this is a production 
>>> environment, you'll have to think about DR and building out a second 
>>> cluster. So the cost of replacing clusters should also be factored in when 
>>> you budget for hardware.
>>>
>>> Like I said, its not a simple answer and you have to approach each instance 
>>> separately and fine tune your cluster plans.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>
>>> ----------------------------------------
>>>> Date: Mon, 2 May 2011 09:53:05 +0300
>>>> From: [email protected]
>>>> To: [email protected]
>>>> CC: [email protected]
>>>> Subject: Re: Hardware configuration
>>>>
>>>> Thank you both. How would you estimate really big clusters, with
>>>> hundreds of nodes? Requirements might change in time and replacing an
>>>> entire cluster seems not the best solution...
>>>>
>>>>
>>>>
>>>> On 04/29/2011 07:08 PM, Stack wrote:
>>>>> I agree with Michel Segel. Distributed computing is hard enough.
>>>>> There is no need to add extra complexity.
>>>>>
>>>>> St.Ack
>>>>>
>>>>> On Fri, Apr 29, 2011 at 4:05 AM, Iulia Zidaru wrote:
>>>>>> Hi,
>>>>>> I'm wondering if having a cluster with different machines in terms of 
>>>>>> CPU,
>>>>>> RAM and disk space would be a big issue for HBase. For example, machines
>>>>>> with 12GBs RAM and machines with 48GBs. We suppose that we use them at 
>>>>>> full
>>>>>> capacity. What problems we might encounter if having this kind of
>>>>>> configuration?
>>>>>> Thank you,
>>>>>> Iulia
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Iulia Zidaru
>>>> Java Developer
>>>>
>>>> 1&1 Internet AG - Bucharest/Romania - Web Components Romania
>>>> 18 Mircea Eliade St
>>>> Sect 1, Bucharest
>>>> RO Bucharest, 012015
>>>> [email protected]
>>>> 0040 31 223 9153
>>>>
>>>>
>>>>
>>>
>>
>>

Reply via email to