We're running slurm-2.2.4 on CentOS-5.5 using sched/wiki to interface to a custom scheduler, and there seems to be a bug happening anytime a job is requeued in slurm (either manually or due to node failure).
Normally, submitted jobs are held until a wiki command is sent telling slurm to launch the job. When a job is requeued, though, slurm does not hold the job and instead actually allocates a node for it and launches it. The wiki-based scheduler expects all requeued jobs to end up pending in the slurm queue but instead I typically find them running (on a different node) and havoc ensues. Here are some slurmctld logs showing an occurance of this behavior. I've interspersed comments for clarity. Jul 22 11:05:30 radmin1 slurmctld[8254]: _slurm_rpc_submit_batch_job JobId=2661723 usec=376 Jul 22 11:06:56 radmin1 slurmctld[8254]: sched: Allocate JobId=2661723 NodeList=rn359 #CPUs=1 Jul 22 11:07:23 radmin1 slurmctld[8254]: completing job 2661723 ### Node rn359 becomes unresponsive Jul 22 11:07:23 radmin1 slurmctld[8254]: Non-responding node, requeue JobId=2661723 Jul 22 11:07:23 radmin1 slurmctld[8254]: sched: job_complete for JobId=2661723 successful ### Slurm requeues job and allocates onto different node Jul 22 11:07:41 radmin1 slurmctld[8254]: requeue batch job 2661723 Jul 22 11:07:51 radmin1 slurmctld[8254]: sched: Allocate JobId=2661723 NodeList=rn364 #CPUs=1 ### wiki attempts to schedule job that it thought would be pending Jul 22 11:08:53 radmin1 slurmctld[8254]: error: wiki: Attempt to start job 2661723 in state RUNNING ### havoc ensues How can we get the slurm requeue to always put a hold on jobs that get requeued? -JE
