Re: [gridengine users] Parallel jobs failure after OS upgrade

Orlando Richards Wed, 11 Apr 2012 08:36:46 -0700

I run a moderately sized cluster (~600 nodes with ~4000 cores, of wildly
mixed vintages) on SGE 6.1u3 (which, yes, is rather ancient). Until
recently, both the master and all the nodes were running CentOS 5 (5.7,
to be precise). I upgraded the nodes to CentOS 6.2, but didn't touch the
master. Our job load is mainly large numbers of single slot jobs, but we
do have some users running parallel code.


Since the upgrade, parallel jobs have been failing at a fairly high
rate. Using Open MPI as the parallel library, the SGE error files of the
jobs report varying numbers of this error:

error: commlib error: can't connect to service (Connection timed out)

Sometimes a job will report that error and seem to still run, and other
times it won't report the error but will fail. Still, it seems like
something new that shouldn't be happening. Also, AFAICT, there are no
corresponding messages in $SGE_ROOT/spool/qmaster/messages.

Does anyone have any ideas as to why I would be seeing this error
(and why it would be so much more frequent after the exec node OS
upgrade)? Any ideas on how to track it down? I'm admittedly at a bit
of a loss
here.


Hi Joshua,

We ran into a problem with infiniband based MPI jobs caused by a
change in the default max locked memory ulimit which init-spawned
processes start with, between RHEL5 and RHEL6.


Good point, but if you have PSM, doesn't it complain about things like
that?  I see log messages for various failure modes, though I don't
think we've had trouble with memory locking.


Yup - though I didn't find the error messages very useful:

node389.4808ipath_userinit: mmap of rcvhdrq failed: Resource temporarilyunavailable

node389.4808Driver initialization failure on /dev/ipath (err=23)

I eventually got a hint on what was happening by doing a pair ofstrace's - one under gridengine, and one not. Of course, I'd checked theulimits previously, but for some reason had totally failed to spot aproblem!

Looking up this thread, I'm not clear what's really failing, but if it's
remote startup failing randomly with non-builtin communication, I'd
suspect clashes in the transient port assignments.

Our fix for this was to put a "ulimit -l unlimited" in our sgeexecd
init script, immediately before the sge_execd startup command. In our
case, "unlimited" is the required value as per the QLogic infiniband
setup process.


See sge_conf(5) for H_MEMORYLOCKED=unlimited in execd_params.


Ahh - great, thanks for that. Much cleaner to do it that way!

For what it's worth, http://arc.liv.ac.uk/downloads/SGE/releases/8.0.0d/
specifically works for me with a Red Hat 5 master and Red Hat 6 compute
nodes.  It will build/install trivially on them, but you need at least
the compatibility openssl package for 6 (maybe more -- I can't remember)
if you want to share binaries.  It also has a large number of fixes over
OGS.

Also I'd be surprised if it was an NFS problem.  I haven't seen such
communications problems in our shared-everything environment, although
the main NFS server is Solaris, and most of the nodes are actually RH5.



--
            --
   Dr Orlando Richards
  Information Services
IT Infrastructure Division
       Unix Section
    Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered inScotland, with registration number SC005336.

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Parallel jobs failure after OS upgrade

Reply via email to