Hi All, Apologies for cross-posting -- not sure which list is the most active these days…?
I'm currently having a real issue with our shared SGE_ROOT directory, which also contains spool directories. It is XFS-formatted on the server, which is also hosts the sgemaster daemon, and shared via NFSv4. The cluster has 108 processors, spread over 11 execution nodes, wired up with 1GE. Under heavy fast scheduling (ie *large* task arrays of very short jobs) we are experiencing server crashes: spinning rpciod and nfsd processes both on clients and on the server cause very high loadavg, alarm states, sgeexecd to go into uninterruptible sleep states, machines falling over etc etc. I would have thought that the NFSv4 shared directory would cope with this load, since the cluster is not massive. However, we have our scheduling delay set to 0, so I'm wondering if this is causing the issue. I'd like to check your collective experience on this one, before changing the cluster config to use local spool dirs. Many thanks, Chris -- Dr Chris Jewell Department of Statistics University of Warwick Coventry CV4 7AL UK Tel: +44 (0)24 7615 0778 _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
