Am 03.04.2012 um 21:49 schrieb Joshua Baker-LePain: > I run a moderately sized cluster (~600 nodes with ~4000 cores, of wildly > mixed vintages) on SGE 6.1u3 (which, yes, is rather ancient). Until recently, > both the master and all the nodes were running CentOS 5 (5.7, to be precise). > I upgraded the nodes to CentOS 6.2, but didn't touch the master. Our job > load is mainly large numbers of single slot jobs, but we do have some users > running parallel code. > > Since the upgrade, parallel jobs have been failing at a fairly high rate. > Using Open MPI as the parallel library, the SGE error files of the jobs > report varying numbers of this error: > > error: commlib error: can't connect to service (Connection timed out)
ethtool shows the correct speed for the network interface? > Sometimes a job will report that error The error from the job is different from a timeout - what in detail? Do you still use the mpiexec the application was compiled with, or start an old binary with a new mpiexec? -- Reuti > and seem to still run, and other times it won't report the error but will > fail. Still, it seems like something new that shouldn't be happening. Also, > AFAICT, there are no corresponding messages in > $SGE_ROOT/spool/qmaster/messages. > > Does anyone have any ideas as to why I would be seeing this error (and why it > would be so much more frequent after the exec node OS upgrade)? Any ideas on > how to track it down? I'm admittedly at a bit of a loss here. > > Thanks. > > -- > Joshua Baker-LePain > QB3 Shared Cluster Sysadmin > UCSF > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
