I run a moderately sized cluster (~600 nodes with ~4000 cores, of wildly
mixed vintages) on SGE 6.1u3 (which, yes, is rather ancient). Until
recently, both the master and all the nodes were running CentOS 5 (5.7,
to be precise). I upgraded the nodes to CentOS 6.2, but didn't touch the
master. Our job load is mainly large numbers of single slot jobs, but we
do have some users running parallel code.
Since the upgrade, parallel jobs have been failing at a fairly high
rate. Using Open MPI as the parallel library, the SGE error files of the
jobs report varying numbers of this error:
error: commlib error: can't connect to service (Connection timed out)
Sometimes a job will report that error and seem to still run, and other
times it won't report the error but will fail. Still, it seems like
something new that shouldn't be happening. Also, AFAICT, there are no
corresponding messages in $SGE_ROOT/spool/qmaster/messages.
Does anyone have any ideas as to why I would be seeing this error (and
why it would be so much more frequent after the exec node OS upgrade)?
Any ideas on how to track it down? I'm admittedly at a bit of a loss
here.
Hi Joshua,
We ran into a problem with infiniband based MPI jobs caused by a change in
the default max locked memory ulimit which init-spawned processes start
with, between RHEL5 and RHEL6.
If you run a job through the old and new environments which just does
"ulimit -a", do you see a difference? Particularly - do you see a
difference in the max locked memory (ulimit -l)?
Our fix for this was to put a "ulimit -l unlimited" in our sgeexecd init
script, immediately before the sge_execd startup command. In
our case, "unlimited" is the required value as per the QLogic infiniband
setup process.
--
Orlando
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users