Hi.  I have a cluster running Rocks 5.4 that has been working perfectly
well for a long time.  Now, suddenly, a problem has emerged.  Jobs
requesting more than a few slots fail to run, remaining in qw indefinitely.

When I do qstat -j <job #> the problem, I get the message " cannot run in
PE "orte_old" because it only offers 25 slots".  However, there are 86
cores available to the queue/PE I'm using, and a sufficient number of them
are free that my job should start immediately.  The PE in question has
slots set to 9999 and uses $fill_up.

I've tried restarting the cluster and restarting the qmaster, and nothing
changes.  I've also checked that all of the machines can communicate with
each other using qrsh hostname.

This problem began shortly after a power outage -- is there anything that
doesn't get touched during a node reinstallation that could be corrupted
that would cause this?  Thanks for any help you can give.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to