<[email protected]> writes: >> I run a moderately sized cluster (~600 nodes with ~4000 cores, of wildly >> mixed vintages) on SGE 6.1u3 (which, yes, is rather ancient). Until >> recently, both the master and all the nodes were running CentOS 5 (5.7, >> to be precise). I upgraded the nodes to CentOS 6.2, but didn't touch the >> master. Our job load is mainly large numbers of single slot jobs, but we >> do have some users running parallel code. >> >> Since the upgrade, parallel jobs have been failing at a fairly high >> rate. Using Open MPI as the parallel library, the SGE error files of the >> jobs report varying numbers of this error: >> >> error: commlib error: can't connect to service (Connection timed out) >> >> Sometimes a job will report that error and seem to still run, and other >> times it won't report the error but will fail. Still, it seems like >> something new that shouldn't be happening. Also, AFAICT, there are no >> corresponding messages in $SGE_ROOT/spool/qmaster/messages. >> >> Does anyone have any ideas as to why I would be seeing this error >> (and why it would be so much more frequent after the exec node OS >> upgrade)? Any ideas on how to track it down? I'm admittedly at a bit >> of a loss >> here. >> > > Hi Joshua, > > We ran into a problem with infiniband based MPI jobs caused by a > change in the default max locked memory ulimit which init-spawned > processes start with, between RHEL5 and RHEL6.
Good point, but if you have PSM, doesn't it complain about things like that? I see log messages for various failure modes, though I don't think we've had trouble with memory locking. Looking up this thread, I'm not clear what's really failing, but if it's remote startup failing randomly with non-builtin communication, I'd suspect clashes in the transient port assignments. > Our fix for this was to put a "ulimit -l unlimited" in our sgeexecd > init script, immediately before the sge_execd startup command. In our > case, "unlimited" is the required value as per the QLogic infiniband > setup process. See sge_conf(5) for H_MEMORYLOCKED=unlimited in execd_params. For what it's worth, http://arc.liv.ac.uk/downloads/SGE/releases/8.0.0d/ specifically works for me with a Red Hat 5 master and Red Hat 6 compute nodes. It will build/install trivially on them, but you need at least the compatibility openssl package for 6 (maybe more -- I can't remember) if you want to share binaries. It also has a large number of fixes over OGS. Also I'd be surprised if it was an NFS problem. I haven't seen such communications problems in our shared-everything environment, although the main NFS server is Solaris, and most of the nodes are actually RH5. -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
