No, I make no additional resource requests, and as far as I'm aware neither
do my users.  I don't use complexes or an sge_request file.  I have a JSV
script to take the requested PE name (say orte) and turn it into the
correct PE for a given queue (for example orte_old for queue old.q).  This
has been working fine for many months however.  In addition, when I disable
this feature, the problem remains.

On Tue, Oct 16, 2012 at 11:32 AM, Reuti <[email protected]> wrote:

> Am 16.10.2012 um 16:58 schrieb Andrew Pearson:
>
> > Hi.  I have a cluster running Rocks 5.4 that has been working perfectly
> well for a long time.  Now, suddenly, a problem has emerged.  Jobs
> requesting more than a few slots fail to run, remaining in qw indefinitely.
> >
> > When I do qstat -j <job #> the problem, I get the message " cannot run
> in PE "orte_old" because it only offers 25 slots".  However, there are 86
> cores available to the queue/PE I'm using, and a sufficient number of them
> are free that my job should start immediately.  The PE in question has
> slots set to 9999 and uses $fill_up.
>
> Do you request anything in  addition like memory (or maybe a default is
> requested by the complex definition, sge_request file or a JSV)?
>
> -- Reuti
>
>
> > I've tried restarting the cluster and restarting the qmaster, and
> nothing changes.  I've also checked that all of the machines can
> communicate with each other using qrsh hostname.
> >
> > This problem began shortly after a power outage -- is there anything
> that doesn't get touched during a node reinstallation that could be
> corrupted that would cause this?  Thanks for any help you can give.
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to