I now have two jobs that are consistently reproduce the issue. I put
any of these on hold and then release it, every time the nodes in the
job's SchedNodeList become available, SLURM seems to hit some
condition that prevents it from actually starting the job. As a result
the nodes in the SchedNodeList are stuck idle forever.

This is rather annoying and dangerous too - the other day a couple of
such jobs ended up hogging hundreds of nodes on our Cray XC.

Can somebody please have a look into this?

--
Szilárd


On Tue, Feb 17, 2015 at 10:56 PM, Szilárd Páll <[email protected]> wrote:
> Hi,
>
> There is a weird behavior which I can't explain and I would like to be
> advised about, in particular because it is either a very strange
> behavior or even a bug.
>
> On the local cluster we are testing the use of feature flags to
> constrain jobs to parts of the machine. There are per-cabinet
> constraints defined, e.g. NodeName=nid000[04-27,29-63]            Feature=c0-0
>
> When submitting I constrain my job using a matching OR constraint, e.g.:
> sbatch --constraint=[c0-0|c1-0|c2-0|...]
>
> After submitting the job, SLURM starts making "room" for it, but when
> all nodes in SchedNodeList become available, the job still does not
> start and these nodes are simply kept idle. Moreover, if multiple jobs
> are in the queue these can get hundreds of nodes assigned and when the
> nodes become available the respective jobs will just keep getting
> their StartTime bumped bit by bit.
>
> Although this is not always the outcome of constrained job submission,
> some of my jobs did go through, but so far it has been quite well
> reproduced.
>
> * Does this sound like "normal" behavior under certain circumstances
> (like mis-configuration) or could it be a bug?
> * Is there any alternative, perhaps better way to implement the above?
>
> Cheers,
> --
> Szilárd

Reply via email to