Stuart Barkley <[email protected]> writes: > SGE is killing jobs on nodes unrelated to the nodes powered off. It > appears to actually kill all other jobs on the cluster.
Reports of that in the past seem to have been blamed on a problem with the spooling at least with bdb, but that seems unlikely in this case. > I'll need to take a look. It is possible that something was left > behind from earlier. I haven't rebooted all the other nodes recently There are definitely problems with execd not sorting out parallel job tasks after a crash, but I'm not sure if that applies here. Maybe check host messages files as well as qmaster's. > Historically, I've not liked shared NFS file systems with lots of R/W > across many systems and I started my installation testing with systems > without good shared NFS server. For low-ish throughput and parallel jobs over ~10 nodes, I don't understand why it would be a problem. We're OK with ~100 nodes and the GE spool on the same filesystem as the nodes' stateless image. _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
