Am 03.04.2012 um 21:49 schrieb Joshua Baker-LePain:

> I run a moderately sized cluster (~600 nodes with ~4000 cores, of wildly 
> mixed vintages) on SGE 6.1u3 (which, yes, is rather ancient). Until recently, 
> both the master and all the nodes were running CentOS 5 (5.7, to be precise). 
>  I upgraded the nodes to CentOS 6.2, but didn't touch the master.  Our job 
> load is mainly large numbers of single slot jobs, but we do have some users 
> running parallel code.
> 
> Since the upgrade, parallel jobs have been failing at a fairly high rate. 
> Using Open MPI as the parallel library, the SGE error files of the jobs 
> report varying numbers of this error:
> 
> error: commlib error: can't connect to service (Connection timed out)

ethtool shows the correct speed for the network interface?


> Sometimes a job will report that error

The error from the job is different from a timeout - what in detail? Do you 
still use the mpiexec the application was compiled with, or start an old binary 
with a new mpiexec?

-- Reuti


> and seem to still run, and other times it won't report the error but will 
> fail.  Still, it seems like something new that shouldn't be happening.  Also, 
> AFAICT, there are no corresponding messages in 
> $SGE_ROOT/spool/qmaster/messages.
> 
> Does anyone have any ideas as to why I would be seeing this error (and why it 
> would be so much more frequent after the exec node OS upgrade)?  Any ideas on 
> how to track it down?  I'm admittedly at a bit of a loss here.
> 
> Thanks.
> 
> -- 
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to