Amandeep Khurana wrote:
Inline.

On Thursday, July 12, 2012 at 12:56 PM, Bartosz M. Frak wrote:

Quick question about data node hadrware. I've read a few articles, which cover the basics, including the Cloudera's recommendations here:
http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/

The article is from early 2010, but I'm assuming that the general guidelines haven't deviated much from the recommended baselines. I'm skewing my build towards the "Compute optimized" side of the spectrum, which calls for a a 1:1 core to spindle model and more RAM for per node for in-memory caching.



Why are you skewing more towards compute optimized. Are you expecting to run compute intensive MR interacting with HBase tables?
Correct. We'll storing dense raw numerical time-based data, which will need to be transformed (decimated, FFTed, correlated, etc) with relatively low latency (under 10 seconds). We also expect repeatable reads, where the same piece of data is "looked" at more than once in a short amount of time. This is where we are hoping that in-memory caching and data node affinity can help us.
Other important consideration is low(ish) power consumption. With that in mind I had specced out the following (per node):

Chassis: 1U Supermicro chassis with 2x 1Gb/sec ethernet ports (http://www.supermicro.com/products/system/1u/5017/sys-5017c-mtf.cfm) (~500USD)
Memory: 32GB Unbuffered ECC RAM (~280USD)
Disks: 4x2TBHitachi Ultrastar 7200RPM SAS Drives (~960USD)



You can use plain SATA. Don't need SAS.
This is a government sponsored project, so some requirements (like MTBF and spindle warranty) for are "set in stone", but I'll look into that.
CPU: 1x Intel E3-1230-v2 (3.3Ghz 4 Core / 8 Thread 69W) (~240USD)



Consider getting dual hex core CPUs.
I'm trying to avoid that for two reasons. Dual socket boards are (1) more expensive and (2) power hungry. Additionally the CPUs for those boards are also more expensive and less efficient than the one socket counterparts (take a look at Intel's E3 and E5 line pricing). The guidelines from the quited article state:

"Compute Intensive Configuration (2U/machine): Two quad core CPUs, 48-72GB memory, and 8 disk drives (1TB or 2TB). These are often used when a combination of large in-memory models and heavy reference data caching is required."

My two 1U machines, which are equivalent to this remediations have 8 (very fast, low wattage) cores, 64GB RAM and 8 2TB disks.

The backplane will consist of a dedicated high powered switch (not sure which one yet) with each node utilizing link aggregation.

Does this look reasonable? We are looking into buying 4-5 of those for our initial test bench for under $10000 and plan to expand to about 50-100 nodes by next year.






Reply via email to