On Thu, Jun 10, 2010 at 2:27 PM, Buttler, David <[email protected]> wrote:
> It turns out that we just received a quote from a supplier where a rack of > 2U 128 GB machines with 16 cores (4x4 I think) and 8 1TB disks is cheaper > than a rack of 1U machines with exactly half the spec (64 GB RAM, 8 core, 4 > 1TB disks). My initial thought was that it would be better to have the 2U > machines as it would give us more flexibility if we wanted to have some > map/reduce jobs that use more than 8 GB per map task. > The only worry is how it would affect HBase. Would it be better to have 20 > region servers with a 16GB heap and 2 dedicated cores, or 40 region servers > with a 8GB heap and one core? [Of course I realize we can't dedicate a core > to a region server, but we can limit the number of map/reduce jobs so that > there would be no more than 14 or 7 of them depending on the configuration] > > Finally, it seems like there are a bunch of related parameters that make > sense to change together depending on heap size and avg row size. Is there > a single place that describes the interrelatedness of the parameters so that > I don't have to guess or reconstruct good settings from 10-100 emails on the > list? If I understood the issues I would be happy to write it up, but I am > afraid I don't. > > Thanks, > Dave > > -----Original Message----- > From: Ryan Rawson [mailto:[email protected]] > Sent: Monday, June 07, 2010 10:51 PM > To: [email protected] > Subject: Re: Big machines or (relatively) small machines? > > I would take it one notch smaller, 32GB ram per node is probably more > than enough... > > It would be hard to get full utilization of 128GB ram, and maybe even > 64GB. With 32GB you might even be able to get 2GB dimms (much > cheaper). > > -ryan > > On Mon, Jun 7, 2010 at 10:48 PM, Sean Bigdatafun > <[email protected]> wrote: > > On Mon, Jun 7, 2010 at 1:13 PM, Todd Lipcon <[email protected]> wrote: > > > >> If those are your actual specs, I would definitely go with 16 of the > >> smaller > >> ones. 128G heaps are not going to work well in a JVM, you're better off > >> running with more nodes with a more common configuration. > >> > > > > I am not using one JVM on a machine, right? Each Map/Reduce task use one > > JVM, I believe. And actually, my question can really be boiled down to > > whether the current map/reduce scheduler is smart enough to make best use > of > > resources. If it is smart enough, I think virtualization does not make > too > > much sense; if it's not smart enough, I guess virtualization may help to > > improve performance. > > > > But you are right, here I was really making up a case -- "128G mem" is > just > > the number doubling the "smaller machine"'s memory. > > > > > > > >> > >> -Todd > >> > >> On Mon, Jun 7, 2010 at 1:46 PM, Jean-Daniel Cryans <[email protected] > >> >wrote: > >> > >> > It really depends on your usage pattern, but there's a balance wrt > >> > cost VS hardware you must achieve. At StumbleUpon we run with 2xi7, > >> > 24GB, 4x 1TB and it works like a charm. The only thing I would change > >> > is maybe more disks/node but that's pretty much it. Some relevant > >> > questions: > >> > > >> > - Do you have any mem-intensive jobs? If so, figure how many tasks > >> > you'll run per node and make the RAM fit the load. > >> > - Do you plan to serve data out of HBase or will you just use it for > >> > MapReduce? Or will it be a mix (not recommended)? > >> > > >> > Also, keep in mind that losing 1 machine over 8 compared to 1 over 16 > >> > drastically changes the performance of your system at the time of the > >> > failure. > >> > > >> > About virtualization, it doesn't make sense. Also your disks should be > in > >> > JBOD. > >> > > >> > J-D > >> > > >> > On Wed, Jun 2, 2010 at 11:12 PM, Sean Bigdatafun > >> > <[email protected]> wrote: > >> > > I am thinking of the following problem lately. I started thinking of > >> this > >> > > problem in the following context. > >> > > > >> > > I have a predefined budget and I can either > >> > > -- A) purchase 8 more powerful servers (4cpu x 4 cores/cpu + 128GB > >> mem > >> > + > >> > > 16 x 1TB disk) or > >> > > -- B) purchase 16 less powerful servers(2cpu x 4 cores/cpu + 64GB > mem > >> + > >> > 8 > >> > > x 1TB disk) > >> > > NOTE: I am basically making up a half housepower scenario > >> > > -- Let's say I am going to use 10Gbps network switch and each > machine > >> > has > >> > > a 10Gbps network card > >> > > > >> > > In the above scenario, does A or B perform better or relatively > same? > >> -- > >> > I > >> > > guess this really depends on Hadoop's map/reduce's scheduler. > >> > > > >> > > And then I have a following question: does it make sense to > virtualize > >> a > >> > > Hadoop datanode at all? (if the answer to above question is > >> "relatively > >> > > same", I'd say it does not make sense) > >> > > > >> > > Thanks, > >> > > Sean > >> > > > >> > > >> > >> > >> > >> -- > >> Todd Lipcon > >> Software Engineer, Cloudera > >> > > > If you based your hardware purchase on someone else application you can (and will) end up over-buying or under bying components. For example, someone might suggest 24 GB RAM. They need this RAM because there application is doing a large amount of Random Read and they need need lots of caching. However your application does not do as much random read and you truley did not need all that RAM. At $1349.99 per node, this is a costly mistake (when you might have only needed 8 or 16) I would like to suggest and alternate approach. Your need a smallish cluster of a few nodes to start. Figure out a way sent 1-10% of your traffic to your new system. Then profile your application to see if it is READ or WRITE or UPDATE intensive, and see which components are under performing and which are overperforming. Some (most) configurations will be similar and you might get lucky if their use case matches yours. But if you are not lucky you end up with a server with too many hard drives and not enough RAM.
