On Wed, 11 Apr 2012 at 3:08pm, Dave Love wrote

Looking up this thread, I'm not clear what's really failing, but if it's
remote startup failing randomly with non-builtin communication, I'd
suspect clashes in the transient port assignments.

Frankly, I'm not sure exactly what's failing either. But the failures come using Open MPI and SGE's standard tight integration, so everything *is* builtin.

Our fix for this was to put a "ulimit -l unlimited" in our sgeexecd
init script, immediately before the sge_execd startup command. In our
case, "unlimited" is the required value as per the QLogic infiniband
setup process.

See sge_conf(5) for H_MEMORYLOCKED=unlimited in execd_params.

I don't think this will help (given I tried the ulimit-in-initscript trick), but I'll note for completeness' sake that this option isn't available in my very old SGE version (6.1u3).

For what it's worth, http://arc.liv.ac.uk/downloads/SGE/releases/8.0.0d/
specifically works for me with a Red Hat 5 master and Red Hat 6 compute
nodes.  It will build/install trivially on them, but you need at least
the compatibility openssl package for 6 (maybe more -- I can't remember)
if you want to share binaries.  It also has a large number of fixes over
OGS.

Have you tested using parallel jobs with (e.g.) 200 slots spread over a number of hosts? For us, at least, the problem seems to rear its head only as the parallel slot count starts getting somewhat high. Also, given google and the lists don't seem to know about this issue, I'm wondering if it isn't something specific to our environment.

Also I'd be surprised if it was an NFS problem.  I haven't seen such
communications problems in our shared-everything environment, although
the main NFS server is Solaris, and most of the nodes are actually RH5.

Yeah, we use NFS minimally and it really isn't anywhere in my differential.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to