I think that there are two important considerations: 1. Can the JVM you're planning on using support a heap of > 10GB, if not, you're wasting money 2. Putting more disk on nodes, means that a failure will take longer to re-replicate back to it's balanced state. i.e. Given you're network topology, how long will even a 50TB machine take, a day a week, longer?
/Ian Architect / Mgr - Novell Vibe On 05/02/2011 09:57 AM, Michael Segel wrote: > > Hi, > > That's actually a really good question. > Unfortunately, the answer isn't really simple. > > You're going to need to estimate your growth and you're going to need to > estimate your configuration. > > Suppose I know that within 2 years, the amount of data that I want to retain > is going to be 1PB, with a 3x replication factor, I'll need at least 3PB of > disk. Assuming that I can fit 12x2TB drives in a node, I'll need 125-150 > machines. (There's some overhead for logging and OS) > > Now this doesn't mean that I'll need to buy all of the machines today and > build out the cluster. > It means that I will need to figure out my machine room, (rack space, power, > etc...) and also hardware configuration. > > You'll also need to plan out your hardware choices too. An example.. you may > want 10GBe on the switch but not at the data node. However you're going to > want to be able to expand your data nodes to be able to add 10GBe cards. > > The idea is that as I build out my cluster, all of the machines have the same > look and feel. So if you buy quad core CPUs and they are 2.2 GHz but 6 months > from now, you buy 2.6 GHz cpus, as long as they are 4 core cpus, your cluster > will look the same. > > The point is that when you lay out your cluster to start with, you'll need to > plan ahead and keep things similar. Also you'll need to make sure your > NameNode has enough memory... > > Having said that... Yahoo! has written a paper detailing MR2 (next generation > of map/reduce). As the M/R Job scheduler becomes more intelligent about the > types of jobs and types of hardware, the consistency of hardware becomes less > important. > > With respect to HBase, I suspect there to be a parallel evolution. > > As to building out and replacing your cluster... if this is a production > environment, you'll have to think about DR and building out a second cluster. > So the cost of replacing clusters should also be factored in when you budget > for hardware. > > Like I said, its not a simple answer and you have to approach each instance > separately and fine tune your cluster plans. > > HTH > > -Mike > > > ---------------------------------------- >> Date: Mon, 2 May 2011 09:53:05 +0300 >> From: [email protected] >> To: [email protected] >> CC: [email protected] >> Subject: Re: Hardware configuration >> >> Thank you both. How would you estimate really big clusters, with >> hundreds of nodes? Requirements might change in time and replacing an >> entire cluster seems not the best solution... >> >> >> >> On 04/29/2011 07:08 PM, Stack wrote: >>> I agree with Michel Segel. Distributed computing is hard enough. >>> There is no need to add extra complexity. >>> >>> St.Ack >>> >>> On Fri, Apr 29, 2011 at 4:05 AM, Iulia Zidaru wrote: >>>> Hi, >>>> I'm wondering if having a cluster with different machines in terms of CPU, >>>> RAM and disk space would be a big issue for HBase. For example, machines >>>> with 12GBs RAM and machines with 48GBs. We suppose that we use them at full >>>> capacity. What problems we might encounter if having this kind of >>>> configuration? >>>> Thank you, >>>> Iulia >>>> >>>> >> >> >> -- >> Iulia Zidaru >> Java Developer >> >> 1&1 Internet AG - Bucharest/Romania - Web Components Romania >> 18 Mircea Eliade St >> Sect 1, Bucharest >> RO Bucharest, 012015 >> [email protected] >> 0040 31 223 9153 >> >> >> >
