Hi, Miguel,

Miguel Gila <[email protected]> writes:

> Hi Loris,
>
> not sure this has been answered before, but have you found a solution
> to it?. We've also seen this, but never came up with the right
> solution, after throwing a few scontrol requeue/hold/release/resume in
> a pseudo-random order, we get the system to reschedule the jobs. Not
> sure it is because of our doing, or because the scheduler doing its
> job :)

No, I didn't find a solution, but after 2 1/2 days with BadConstraints,
the job ran.  I assume the draining node went into the drained state,
was rebooted and became available again.  On our system, the probability
of a pending MPI job being having one of its node drained is fairly
small, so it is not such a problem.

Of course if a node failed completely, that would be different, but
perhaps in that case the job would be rescheduled automatically.

Cheers,

Loris

> Cheers,
> Miguel
>> On 24 May 2016, at 08:46, Loris Bennett <[email protected]> wrote:
>> 
>> 
>> Hi,
>> 
>> The 'Reason' field for a pending job has changed from 'Priority' to
>> 'BadConstraints'.  This seems to be because the status of one of the
>> nodes in the node list reported by 'scontrol show job' has changed to
>> 'draining'.  The job itself just specifies the number of tasks required,
>> not specific nodes.
>> 
>> Shouldn't the scheduler just be able to replace the draining node with
>> another node in the projected node list?  This is happening with version
>> 15.08.8.
>> 
>> Cheers,
>> 
>> Loris
>> 
>> -- 
>> Dr. Loris Bennett (Mr.)
>> ZEDAT, Freie Universität Berlin         Email [email protected]

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email [email protected]

Reply via email to