Hi, Miguel, Miguel Gila <[email protected]> writes:
> Hi Loris, > > not sure this has been answered before, but have you found a solution > to it?. We've also seen this, but never came up with the right > solution, after throwing a few scontrol requeue/hold/release/resume in > a pseudo-random order, we get the system to reschedule the jobs. Not > sure it is because of our doing, or because the scheduler doing its > job :) No, I didn't find a solution, but after 2 1/2 days with BadConstraints, the job ran. I assume the draining node went into the drained state, was rebooted and became available again. On our system, the probability of a pending MPI job being having one of its node drained is fairly small, so it is not such a problem. Of course if a node failed completely, that would be different, but perhaps in that case the job would be rescheduled automatically. Cheers, Loris > Cheers, > Miguel >> On 24 May 2016, at 08:46, Loris Bennett <[email protected]> wrote: >> >> >> Hi, >> >> The 'Reason' field for a pending job has changed from 'Priority' to >> 'BadConstraints'. This seems to be because the status of one of the >> nodes in the node list reported by 'scontrol show job' has changed to >> 'draining'. The job itself just specifies the number of tasks required, >> not specific nodes. >> >> Shouldn't the scheduler just be able to replace the draining node with >> another node in the projected node list? This is happening with version >> 15.08.8. >> >> Cheers, >> >> Loris >> >> -- >> Dr. Loris Bennett (Mr.) >> ZEDAT, Freie Universität Berlin Email [email protected] -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email [email protected]
