[gridengine users] Parallel jobs failure after OS upgrade

Joshua Baker-LePain Tue, 03 Apr 2012 12:51:11 -0700

I run a moderately sized cluster (~600 nodes with ~4000 cores, of wildlymixed vintages) on SGE 6.1u3 (which, yes, is rather ancient). Untilrecently, both the master and all the nodes were running CentOS 5 (5.7, tobe precise). I upgraded the nodes to CentOS 6.2, but didn't touch themaster. Our job load is mainly large numbers of single slot jobs, but wedo have some users running parallel code.

Since the upgrade, parallel jobs have been failing at a fairly high rate.Using Open MPI as the parallel library, the SGE error files of the jobsreport varying numbers of this error:


error: commlib error: can't connect to service (Connection timed out)

Sometimes a job will report that error and seem to still run, and othertimes it won't report the error but will fail. Still, it seems likesomething new that shouldn't be happening. Also, AFAICT, there are nocorresponding messages in $SGE_ROOT/spool/qmaster/messages.

Does anyone have any ideas as to why I would be seeing this error (and whyit would be so much more frequent after the exec node OS upgrade)? Anyideas on how to track it down? I'm admittedly at a bit of a loss here.


Thanks.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] Parallel jobs failure after OS upgrade

Reply via email to