Sorry - I meant to answer Iulia, not Michael. I was speaking more generally, as there is also no guarantee that MR jobs are running. So perhaps I should add in deployment / running server architecture.
/Ian On 05/02/2011 01:47 PM, Jean-Daniel Cryans wrote: > Ian, > > Regarding your first point, I understand where the concern is coming > from but I'd like to point out that with the new MemStore-Local > Allocation Buffers the full GCs taking minutes might not be as much as > an issue as it used to be. That said, I haven't tested that out yet > and I don't know of anyone that did it. > > Your second point is dead-on. Also not only it takes time to > replicate, but it can also steal precious IO and in 0.20 it's pretty > much impossible to limit the rate of re-replication. > > J-D > > On Mon, May 2, 2011 at 7:30 AM, Ian Roughley <[email protected]> wrote: >> I think that there are two important considerations: >> 1. Can the JVM you're planning on using support a heap of > 10GB, if not, >> you're wasting money >> 2. Putting more disk on nodes, means that a failure will take longer to >> re-replicate back to it's >> balanced state. i.e. Given you're network topology, how long will even a >> 50TB machine take, a day a >> week, longer? >> >> /Ian >> Architect / Mgr - Novell Vibe >> >> On 05/02/2011 09:57 AM, Michael Segel wrote: >>> >>> Hi, >>> >>> That's actually a really good question. >>> Unfortunately, the answer isn't really simple. >>> >>> You're going to need to estimate your growth and you're going to need to >>> estimate your configuration. >>> >>> Suppose I know that within 2 years, the amount of data that I want to >>> retain is going to be 1PB, with a 3x replication factor, I'll need at least >>> 3PB of disk. Assuming that I can fit 12x2TB drives in a node, I'll need >>> 125-150 machines. (There's some overhead for logging and OS) >>> >>> Now this doesn't mean that I'll need to buy all of the machines today and >>> build out the cluster. >>> It means that I will need to figure out my machine room, (rack space, >>> power, etc...) and also hardware configuration. >>> >>> You'll also need to plan out your hardware choices too. An example.. you >>> may want 10GBe on the switch but not at the data node. However you're going >>> to want to be able to expand your data nodes to be able to add 10GBe cards. >>> >>> The idea is that as I build out my cluster, all of the machines have the >>> same look and feel. So if you buy quad core CPUs and they are 2.2 GHz but 6 >>> months from now, you buy 2.6 GHz cpus, as long as they are 4 core cpus, >>> your cluster will look the same. >>> >>> The point is that when you lay out your cluster to start with, you'll need >>> to plan ahead and keep things similar. Also you'll need to make sure your >>> NameNode has enough memory... >>> >>> Having said that... Yahoo! has written a paper detailing MR2 (next >>> generation of map/reduce). As the M/R Job scheduler becomes more >>> intelligent about the types of jobs and types of hardware, the consistency >>> of hardware becomes less important. >>> >>> With respect to HBase, I suspect there to be a parallel evolution. >>> >>> As to building out and replacing your cluster... if this is a production >>> environment, you'll have to think about DR and building out a second >>> cluster. So the cost of replacing clusters should also be factored in when >>> you budget for hardware. >>> >>> Like I said, its not a simple answer and you have to approach each instance >>> separately and fine tune your cluster plans. >>> >>> HTH >>> >>> -Mike >>> >>> >>> ---------------------------------------- >>>> Date: Mon, 2 May 2011 09:53:05 +0300 >>>> From: [email protected] >>>> To: [email protected] >>>> CC: [email protected] >>>> Subject: Re: Hardware configuration >>>> >>>> Thank you both. How would you estimate really big clusters, with >>>> hundreds of nodes? Requirements might change in time and replacing an >>>> entire cluster seems not the best solution... >>>> >>>> >>>> >>>> On 04/29/2011 07:08 PM, Stack wrote: >>>>> I agree with Michel Segel. Distributed computing is hard enough. >>>>> There is no need to add extra complexity. >>>>> >>>>> St.Ack >>>>> >>>>> On Fri, Apr 29, 2011 at 4:05 AM, Iulia Zidaru wrote: >>>>>> Hi, >>>>>> I'm wondering if having a cluster with different machines in terms of >>>>>> CPU, >>>>>> RAM and disk space would be a big issue for HBase. For example, machines >>>>>> with 12GBs RAM and machines with 48GBs. We suppose that we use them at >>>>>> full >>>>>> capacity. What problems we might encounter if having this kind of >>>>>> configuration? >>>>>> Thank you, >>>>>> Iulia >>>>>> >>>>>> >>>> >>>> >>>> -- >>>> Iulia Zidaru >>>> Java Developer >>>> >>>> 1&1 Internet AG - Bucharest/Romania - Web Components Romania >>>> 18 Mircea Eliade St >>>> Sect 1, Bucharest >>>> RO Bucharest, 012015 >>>> [email protected] >>>> 0040 31 223 9153 >>>> >>>> >>>> >>> >> >>
