is SElinux on or off? Sent from my iPad
On Apr 3, 2012, at 19:43, Rayson Ho <[email protected]> wrote: > Is it possible that some nodes have a firewall running while some don't?? > > Rayson > > > > On Tue, Apr 3, 2012 at 3:49 PM, Joshua Baker-LePain <[email protected]> wrote: >> I run a moderately sized cluster (~600 nodes with ~4000 cores, of wildly >> mixed vintages) on SGE 6.1u3 (which, yes, is rather ancient). Until >> recently, both the master and all the nodes were running CentOS 5 (5.7, to >> be precise). I upgraded the nodes to CentOS 6.2, but didn't touch the >> master. Our job load is mainly large numbers of single slot jobs, but we do >> have some users running parallel code. >> >> Since the upgrade, parallel jobs have been failing at a fairly high rate. >> Using Open MPI as the parallel library, the SGE error files of the jobs >> report varying numbers of this error: >> >> error: commlib error: can't connect to service (Connection timed out) >> >> Sometimes a job will report that error and seem to still run, and other >> times it won't report the error but will fail. Still, it seems like >> something new that shouldn't be happening. Also, AFAICT, there are no >> corresponding messages in $SGE_ROOT/spool/qmaster/messages. >> >> Does anyone have any ideas as to why I would be seeing this error (and why >> it would be so much more frequent after the exec node OS upgrade)? Any >> ideas on how to track it down? I'm admittedly at a bit of a loss here. >> >> Thanks. >> >> -- >> Joshua Baker-LePain >> QB3 Shared Cluster Sysadmin >> UCSF >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
