No, I make no additional resource requests, and as far as I'm aware neither do my users. I don't use complexes or an sge_request file. I have a JSV script to take the requested PE name (say orte) and turn it into the correct PE for a given queue (for example orte_old for queue old.q). This has been working fine for many months however. In addition, when I disable this feature, the problem remains.
On Tue, Oct 16, 2012 at 11:32 AM, Reuti <[email protected]> wrote: > Am 16.10.2012 um 16:58 schrieb Andrew Pearson: > > > Hi. I have a cluster running Rocks 5.4 that has been working perfectly > well for a long time. Now, suddenly, a problem has emerged. Jobs > requesting more than a few slots fail to run, remaining in qw indefinitely. > > > > When I do qstat -j <job #> the problem, I get the message " cannot run > in PE "orte_old" because it only offers 25 slots". However, there are 86 > cores available to the queue/PE I'm using, and a sufficient number of them > are free that my job should start immediately. The PE in question has > slots set to 9999 and uses $fill_up. > > Do you request anything in addition like memory (or maybe a default is > requested by the complex definition, sge_request file or a JSV)? > > -- Reuti > > > > I've tried restarting the cluster and restarting the qmaster, and > nothing changes. I've also checked that all of the machines can > communicate with each other using qrsh hostname. > > > > This problem began shortly after a power outage -- is there anything > that doesn't get touched during a node reinstallation that could be > corrupted that would cause this? Thanks for any help you can give. > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
