We're running slurm-2.2.4 on CentOS-5.5 using sched/wiki to interface to
a custom scheduler, and there seems to be a bug happening anytime a job
is requeued in slurm (either manually or due to node failure).

Normally, submitted jobs are held until a wiki command is sent telling
slurm to launch the job.  When a job is requeued, though, slurm does not
hold the job and instead actually allocates a node for it and launches
it.  The wiki-based scheduler expects all requeued jobs to end up pending
in the slurm queue but instead I typically find them running (on a
different node) and havoc ensues.  Here are some slurmctld logs showing
an occurance of this behavior.  I've interspersed comments for clarity.


Jul 22 11:05:30 radmin1 slurmctld[8254]: _slurm_rpc_submit_batch_job
JobId=2661723 usec=376
Jul 22 11:06:56 radmin1 slurmctld[8254]: sched: Allocate JobId=2661723
NodeList=rn359 #CPUs=1
Jul 22 11:07:23 radmin1 slurmctld[8254]: completing job 2661723

### Node rn359 becomes unresponsive
Jul 22 11:07:23 radmin1 slurmctld[8254]: Non-responding node, requeue
JobId=2661723
Jul 22 11:07:23 radmin1 slurmctld[8254]: sched: job_complete for
JobId=2661723 successful

### Slurm requeues job and allocates onto different node
Jul 22 11:07:41 radmin1 slurmctld[8254]: requeue batch job 2661723
Jul 22 11:07:51 radmin1 slurmctld[8254]: sched: Allocate JobId=2661723
NodeList=rn364 #CPUs=1

### wiki attempts to schedule job that it thought would be pending
Jul 22 11:08:53 radmin1 slurmctld[8254]: error: wiki: Attempt to start
job 2661723 in state RUNNING
### havoc ensues

How can we get the slurm requeue to always put a hold on jobs that get
requeued?

-JE


Reply via email to