In the message dated: Thu, 11 Feb 2010 15:43:12 PST,
The pithy ruminations from Ski Kacoroski on 
<Re: [lopsa-tech] Spec'ing out a small compute cluster for R&D - Thanks for the
 feedback> were:
=> Hi,
=> 
=> I really, really appreciate all the ideas that folks have put forward 
=> from both email lists.  They have helped me to better define what the 
=> customer needs.  This is a development and test cluster used for testing 
=> new algorithms and benchmarks for parallel R.

Ah, much better.


=> 
=> The cluster will run custom statistics software that does analysis on 
=> very large data sets by spreading the work across nodes and cores so 
=> they need multiple nodes in addition to multiple cores.  The application 
=> is expected to be I/O bound as it will be moving files up to 400GB 
=> around between the nodes and between the nodes and the permanent 
=> storage.  The data on the nodes is "semi-permanent" which means I need 
=> to mirror the disks so a disk failure will not result in data loss.

Much better definition!

Will the same data partition need to be available under both Windows & Linux?

You can tweak the ROCKS custom partitioning to specify how the persistent
data partition (/state/partition1) is set up...and I suppose it would be 
possible to use NTFS instead of ext3.

=> 
=> So I am looking at:
=> 
=> - 5 nodes with (1) quad newhalem, 8 - 16GB ram, either mirrored 6Gb/s 

No. If you're using Nehalem you do NOT want 8 or 16GB of memory. Nehalems use 
memory in 3 channels, and give you the best performance by having one DIMM per 
bank. If you must have more memory (ie., you want 24GB, and the 8GB DIMMS are 
too expensive) then fully populate 2 banks (6 x 4GB) instead of partially 
populating one bank.

Multiple banks--even of the same memory--will slow down access.

There's a very good explanation from Dell:

        
http://en.community.dell.com/blogs/dell_tech_center/archive/2009/04/08/nehalem-and-memory-configurations.aspx


=> sata disks or multiple sas/scsi disks in a mirrored stripped setup.

If the I/O is heavily biased toward large sequential writes, without contention 
from multiple jobs per node, then SATA is probably fine...otherwise I'd lean 
towards SAS.


Mirrored drives are great. Not knowing when a drive has died is bad.

Consider yet-another-network...a connection to the IPMI controller in each 
node. That'll let you do out-of-band power cycling, and it gives you an 
OS-independent way to query each node for hardware status. In other words, the 
head node can gather info that a hard drive has failed in node 3 without
running client-side software (which eliminates the need to configure & tweek it 
for both OSs). 

=> - two switches - one for data to SAN and controller, one for 
=> interconnects between nodes

OK. You can use GigE for the cluster interconnect, but it's not ideal.

For the data switch, I'd make sure that it'll support jumbo frames.

=> - rhel/rocks and windows/hpc for the os set up as dual boot on each 
=> node.  This means that the developers will need to reconfigure the nodes 
=> by hand which is ok.

Not much of a big deal. I have no experiece with Windows HPC, but with ROCKS 
the headnode can signal each node to reboot. It may be a bit tricky to automate 
the OS selection, but for 5 nodes it wouldn't be terrible to do it manually.

In that case, I'd definitely want KVM to each node....it will make 
troubleshooting boot errors much easier.

=> - single controller machine running esxi with 2 virtual machines - one 
=> is windows/hpc and one is rhel/rocks.  That way both the controller 
=> machines can run all the time.
=> - some sort of san or nas to provide shared space
=> 
=> I am not looking at redundant power supplies as this is a dev/test cluster.

You may want redundant power on the head node.

=> 
=> I do plan on checking into Silicon Mechanics and Dell.

I'd say that I've had good experience with both, but I'll reserve judgement
pending the outcome of my latest interaction with Dell/EMC support.

=> 
=> It looks like separate servers are a cheaper way to go than a blade system.

Absolutely.

Mark

=> 
=> Thanks again for everyone's feedback.
=> 
=> cheers,
=> 
=> ski
=> 
=> 
=> -- 
=> "When we try to pick out anything by itself, we find it
=>   connected to the entire universe"            John Muir
=> 
=> Chris "Ski" Kacoroski, [email protected], 206-501-9803
=> or ski98033 on most IM services
=> _______________________________________________
=> Tech mailing list
=> [email protected]
=> http://lopsa.org/cgi-bin/mailman/listinfo/tech
=> This list provided by the League of Professional System Administrators
=>  http://lopsa.org/
=> 



_______________________________________________
Tech mailing list
[email protected]
http://lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to