Stuart Barkley <[email protected]> writes:

> SGE is killing jobs on nodes unrelated to the nodes powered off.  It
> appears to actually kill all other jobs on the cluster.

Reports of that in the past seem to have been blamed on a problem with
the spooling at least with bdb, but that seems unlikely in this case.

> I'll need to take a look.  It is possible that something was left
> behind from earlier.  I haven't rebooted all the other nodes recently

There are definitely problems with execd not sorting out parallel job
tasks after a crash, but I'm not sure if that applies here.

Maybe check host messages files as well as qmaster's.

> Historically, I've not liked shared NFS file systems with lots of R/W
> across many systems and I started my installation testing with systems
> without good shared NFS server.

For low-ish throughput and parallel jobs over ~10 nodes, I don't
understand why it would be a problem.  We're OK with ~100 nodes and the
GE spool on the same filesystem as the nodes' stateless image.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to