Re: [gridengine users] Parallel jobs failure after OS upgrade

Dave Love Wed, 11 Apr 2012 07:10:26 -0700

<[email protected]> writes:

>> I run a moderately sized cluster (~600 nodes with ~4000 cores, of wildly
>> mixed vintages) on SGE 6.1u3 (which, yes, is rather ancient). Until
>> recently, both the master and all the nodes were running CentOS 5 (5.7,
>> to be precise). I upgraded the nodes to CentOS 6.2, but didn't touch the
>> master. Our job load is mainly large numbers of single slot jobs, but we
>> do have some users running parallel code.
>>
>> Since the upgrade, parallel jobs have been failing at a fairly high
>> rate. Using Open MPI as the parallel library, the SGE error files of the
>> jobs report varying numbers of this error:
>>
>> error: commlib error: can't connect to service (Connection timed out)
>>
>> Sometimes a job will report that error and seem to still run, and other
>> times it won't report the error but will fail. Still, it seems like
>> something new that shouldn't be happening. Also, AFAICT, there are no
>> corresponding messages in $SGE_ROOT/spool/qmaster/messages.
>>
>> Does anyone have any ideas as to why I would be seeing this error
>> (and why it would be so much more frequent after the exec node OS
>> upgrade)? Any ideas on how to track it down? I'm admittedly at a bit
>> of a loss
>> here.
>>
>
> Hi Joshua,
>
> We ran into a problem with infiniband based MPI jobs caused by a
> change in the default max locked memory ulimit which init-spawned
> processes start with, between RHEL5 and RHEL6.


Good point, but if you have PSM, doesn't it complain about things like
that?  I see log messages for various failure modes, though I don't
think we've had trouble with memory locking.

Looking up this thread, I'm not clear what's really failing, but if it's
remote startup failing randomly with non-builtin communication, I'd
suspect clashes in the transient port assignments.

> Our fix for this was to put a "ulimit -l unlimited" in our sgeexecd
> init script, immediately before the sge_execd startup command. In our
> case, "unlimited" is the required value as per the QLogic infiniband
> setup process.

See sge_conf(5) for H_MEMORYLOCKED=unlimited in execd_params.

For what it's worth, http://arc.liv.ac.uk/downloads/SGE/releases/8.0.0d/
specifically works for me with a Red Hat 5 master and Red Hat 6 compute
nodes.  It will build/install trivially on them, but you need at least
the compatibility openssl package for 6 (maybe more -- I can't remember)
if you want to share binaries.  It also has a large number of fixes over
OGS.

Also I'd be surprised if it was an NFS problem.  I haven't seen such
communications problems in our shared-everything environment, although
the main NFS server is Solaris, and most of the nodes are actually RH5.

-- 
Community Grid Engine: http://arc.liv.ac.uk/SGE/
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Parallel jobs failure after OS upgrade

Reply via email to