Is it possible that some nodes have a firewall running while some don't??

Rayson



On Tue, Apr 3, 2012 at 3:49 PM, Joshua Baker-LePain <[email protected]> wrote:
> I run a moderately sized cluster (~600 nodes with ~4000 cores, of wildly
> mixed vintages) on SGE 6.1u3 (which, yes, is rather ancient). Until
> recently, both the master and all the nodes were running CentOS 5 (5.7, to
> be precise).  I upgraded the nodes to CentOS 6.2, but didn't touch the
> master.  Our job load is mainly large numbers of single slot jobs, but we do
> have some users running parallel code.
>
> Since the upgrade, parallel jobs have been failing at a fairly high rate.
> Using Open MPI as the parallel library, the SGE error files of the jobs
> report varying numbers of this error:
>
> error: commlib error: can't connect to service (Connection timed out)
>
> Sometimes a job will report that error and seem to still run, and other
> times it won't report the error but will fail.  Still, it seems like
> something new that shouldn't be happening.  Also, AFAICT, there are no
> corresponding messages in $SGE_ROOT/spool/qmaster/messages.
>
> Does anyone have any ideas as to why I would be seeing this error (and why
> it would be so much more frequent after the exec node OS upgrade)?  Any
> ideas on how to track it down?  I'm admittedly at a bit of a loss here.
>
> Thanks.
>
> --
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to