I run a moderately sized cluster (~600 nodes with ~4000 cores, of wildly mixed vintages) on SGE 6.1u3 (which, yes, is rather ancient). Until recently, both the master and all the nodes were running CentOS 5 (5.7, to be precise). I upgraded the nodes to CentOS 6.2, but didn't touch the master. Our job load is mainly large numbers of single slot jobs, but we do have some users running parallel code.

Since the upgrade, parallel jobs have been failing at a fairly high rate. Using Open MPI as the parallel library, the SGE error files of the jobs report varying numbers of this error:

error: commlib error: can't connect to service (Connection timed out)

Sometimes a job will report that error and seem to still run, and other times it won't report the error but will fail. Still, it seems like something new that shouldn't be happening. Also, AFAICT, there are no corresponding messages in $SGE_ROOT/spool/qmaster/messages.

Does anyone have any ideas as to why I would be seeing this error (and why it would be so much more frequent after the exec node OS upgrade)? Any ideas on how to track it down? I'm admittedly at a bit of a loss here.

Thanks.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to