hi
may be just upgrade to ogs6.2u5p2, you can do inplace upgrade or upgrade to 
different directory and port number
regards
of course you need to compiler openmpi with sge

Sent from my iPad

On Apr 3, 2012, at 15:49, Joshua Baker-LePain <[email protected]> wrote:

> I run a moderately sized cluster (~600 nodes with ~4000 cores, of wildly 
> mixed vintages) on SGE 6.1u3 (which, yes, is rather ancient). Until recently, 
> both the master and all the nodes were running CentOS 5 (5.7, to be precise). 
>  I upgraded the nodes to CentOS 6.2, but didn't touch the master.  Our job 
> load is mainly large numbers of single slot jobs, but we do have some users 
> running parallel code.
> 
> Since the upgrade, parallel jobs have been failing at a fairly high rate. 
> Using Open MPI as the parallel library, the SGE error files of the jobs 
> report varying numbers of this error:
> 
> error: commlib error: can't connect to service (Connection timed out)
> 
> Sometimes a job will report that error and seem to still run, and other times 
> it won't report the error but will fail.  Still, it seems like something new 
> that shouldn't be happening.  Also, AFAICT, there are no corresponding 
> messages in $SGE_ROOT/spool/qmaster/messages.
> 
> Does anyone have any ideas as to why I would be seeing this error (and why it 
> would be so much more frequent after the exec node OS upgrade)?  Any ideas on 
> how to track it down?  I'm admittedly at a bit of a loss here.
> 
> Thanks.
> 
> -- 
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to