Re: [gridengine users] Parallel jobs failure after OS upgrade

orlando . richards Wed, 11 Apr 2012 02:47:21 -0700

I run a moderately sized cluster (~600 nodes with ~4000 cores, of wildly
mixed vintages) on SGE 6.1u3 (which, yes, is rather ancient). Until
recently, both the master and all the nodes were running CentOS 5 (5.7,
to be precise). I upgraded the nodes to CentOS 6.2, but didn't touch the
master. Our job load is mainly large numbers of single slot jobs, but we
do have some users running parallel code.


Since the upgrade, parallel jobs have been failing at a fairly high
rate. Using Open MPI as the parallel library, the SGE error files of the
jobs report varying numbers of this error:

error: commlib error: can't connect to service (Connection timed out)

Sometimes a job will report that error and seem to still run, and other
times it won't report the error but will fail. Still, it seems like
something new that shouldn't be happening. Also, AFAICT, there are no
corresponding messages in $SGE_ROOT/spool/qmaster/messages.

Does anyone have any ideas as to why I would be seeing this error (andwhy it would be so much more frequent after the exec node OS upgrade)?Any ideas on how to track it down? I'm admittedly at a bit of a loss

here.


Hi Joshua,

We ran into a problem with infiniband based MPI jobs caused by a change inthe default max locked memory ulimit which init-spawned processes startwith, between RHEL5 and RHEL6.

If you run a job through the old and new environments which just does"ulimit -a", do you see a difference? Particularly - do you see adifference in the max locked memory (ulimit -l)?

Our fix for this was to put a "ulimit -l unlimited" in our sgeexecd initscript, immediately before the sge_execd startup command. Inour case, "unlimited" is the required value as per the QLogic infinibandsetup process.



--
Orlando


--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Parallel jobs failure after OS upgrade

Reply via email to