Am 16.10.2012 um 16:58 schrieb Andrew Pearson:

> Hi.  I have a cluster running Rocks 5.4 that has been working perfectly well 
> for a long time.  Now, suddenly, a problem has emerged.  Jobs requesting more 
> than a few slots fail to run, remaining in qw indefinitely.
> 
> When I do qstat -j <job #> the problem, I get the message " cannot run in PE 
> "orte_old" because it only offers 25 slots".  However, there are 86 cores 
> available to the queue/PE I'm using, and a sufficient number of them are free 
> that my job should start immediately.  The PE in question has slots set to 
> 9999 and uses $fill_up.  

Do you request anything in  addition like memory (or maybe a default is 
requested by the complex definition, sge_request file or a JSV)?

-- Reuti


> I've tried restarting the cluster and restarting the qmaster, and nothing 
> changes.  I've also checked that all of the machines can communicate with 
> each other using qrsh hostname.
> 
> This problem began shortly after a power outage -- is there anything that 
> doesn't get touched during a node reinstallation that could be corrupted that 
> would cause this?  Thanks for any help you can give.
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to