I run a moderately sized cluster (~600 nodes with ~4000 cores, of wildly
mixed vintages) on SGE 6.1u3 (which, yes, is rather ancient). Until
recently, both the master and all the nodes were running CentOS 5 (5.7,
to be precise). I upgraded the nodes to CentOS 6.2, but didn't touch the
master. Our job load is mainly large numbers of single slot jobs, but we
do have some users running parallel code.
Since the upgrade, parallel jobs have been failing at a fairly high
rate. Using Open MPI as the parallel library, the SGE error files of the
jobs report varying numbers of this error:
error: commlib error: can't connect to service (Connection timed out)
Sometimes a job will report that error and seem to still run, and other
times it won't report the error but will fail. Still, it seems like
something new that shouldn't be happening. Also, AFAICT, there are no
corresponding messages in $SGE_ROOT/spool/qmaster/messages.
Does anyone have any ideas as to why I would be seeing this error
(and why it would be so much more frequent after the exec node OS
upgrade)? Any ideas on how to track it down? I'm admittedly at a bit
of a loss
here.
Hi Joshua,
We ran into a problem with infiniband based MPI jobs caused by a
change in the default max locked memory ulimit which init-spawned
processes start with, between RHEL5 and RHEL6.
Good point, but if you have PSM, doesn't it complain about things like
that? I see log messages for various failure modes, though I don't
think we've had trouble with memory locking.
Yup - though I didn't find the error messages very useful:
node389.4808ipath_userinit: mmap of rcvhdrq failed: Resource temporarily
unavailable
node389.4808Driver initialization failure on /dev/ipath (err=23)
I eventually got a hint on what was happening by doing a pair of
strace's - one under gridengine, and one not. Of course, I'd checked the
ulimits previously, but for some reason had totally failed to spot a
problem!
Looking up this thread, I'm not clear what's really failing, but if it's
remote startup failing randomly with non-builtin communication, I'd
suspect clashes in the transient port assignments.
Our fix for this was to put a "ulimit -l unlimited" in our sgeexecd
init script, immediately before the sge_execd startup command. In our
case, "unlimited" is the required value as per the QLogic infiniband
setup process.
See sge_conf(5) for H_MEMORYLOCKED=unlimited in execd_params.
Ahh - great, thanks for that. Much cleaner to do it that way!
For what it's worth, http://arc.liv.ac.uk/downloads/SGE/releases/8.0.0d/
specifically works for me with a Red Hat 5 master and Red Hat 6 compute
nodes. It will build/install trivially on them, but you need at least
the compatibility openssl package for 6 (maybe more -- I can't remember)
if you want to share binaries. It also has a large number of fixes over
OGS.
Also I'd be surprised if it was an NFS problem. I haven't seen such
communications problems in our shared-everything environment, although
the main NFS server is Solaris, and most of the nodes are actually RH5.
--
--
Dr Orlando Richards
Information Services
IT Infrastructure Division
Unix Section
Tel: 0131 650 4994
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users