SGE is fine on 1GB fabrics and I don't know of anyone who uses 10Gb for SGE unless it's a combined network fabric that is carrying storage and application traffic along with SGE traffic on the same links. Or if you are running all new stuff with 10Gb for everything and maybe a 1GB NIC held back for ILO/DRAC/IPMI/provisioning usage

The commonly accepted point at which you'd hit a scaling limit on an ethernet network would most likely be determined, not by Grid Engine traffic but by:
 - Network filesystem traffic for shared storage
 - Application message passing traffic

I don't see SGE native traffic as a huge consumer of bandwidth or network resources in most cases. It's the "other stuff' that blows out the network.

And there is no one size fits all answer there as people's HPC footprints vary wildly by how they are used and what they are architected for.

SGE can run at massive scale over 1Gb network fabric without issues. The only time 1Gb network becomes the bottleneck is when you try to stuff NFS and application traffic down the same pipe. And even then you'd hit performance and job throughput problems before you hit a scaling limit wall. If you've got SGE running on a mostly free 1GB fabric (maybe it's your admin or provisioning network etc.) you'd be fine at even large scale.

The sorts of tuning you'd do to run "big SGE" on a 1GB fabric would be to:

 - Tune the qmaster host to handle the # of endpoints expected
- Make darn sure application traffic and storage traffic is on a different network - If you have to share the 1Gbe with other traffic than configure SGE for local spooling. The danger here is performance impact, not scaling

My $.02




Lane, William <mailto:william.l...@cshs.org>
September 24, 2015 at 6:04 PM
If a cluster is running on a relatively slow speed networking backbone (say gigabit ethernet or 10 Gib ethernet as opposed to inifiniband), is there any commonly accepted point at which increasing the number of nodes in a queue negatively affects the performance of the queue? Is there any general rule about how many nodes to have in a queue based on a given network backbone?

-Bill L.


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to