Hi. I have a cluster running Rocks 5.4 that has been working perfectly well for a long time. Now, suddenly, a problem has emerged. Jobs requesting more than a few slots fail to run, remaining in qw indefinitely.
When I do qstat -j <job #> the problem, I get the message " cannot run in PE "orte_old" because it only offers 25 slots". However, there are 86 cores available to the queue/PE I'm using, and a sufficient number of them are free that my job should start immediately. The PE in question has slots set to 9999 and uses $fill_up. I've tried restarting the cluster and restarting the qmaster, and nothing changes. I've also checked that all of the machines can communicate with each other using qrsh hostname. This problem began shortly after a power outage -- is there anything that doesn't get touched during a node reinstallation that could be corrupted that would cause this? Thanks for any help you can give.
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
