Hi Josh,

This one line patch should solve this problem. It will be in version 2.3.0 when released, but you can patch 2.2.4 if desired.

Moe


diff --git a/src/plugins/sched/wiki/sched_wiki.c b/src/plugins/sched/wiki/sched_wiki.c
index b33b9d6..9d9a668 100644
--- a/src/plugins/sched/wiki/sched_wiki.c
+++ b/src/plugins/sched/wiki/sched_wiki.c
@@ -171,7 +171,7 @@ char *slurm_sched_strerror( int errnum )
 /**************************************************************************/
 void slurm_sched_plugin_requeue( struct job_record *job_ptr, char *reason )
 {
-       /* Empty. */
+       job_ptr->priority = 0;
 }

 /**************************************************************************/

Quoting Josh England <[email protected]>:

We're running slurm-2.2.4 on CentOS-5.5 using sched/wiki to interface to
a custom scheduler, and there seems to be a bug happening anytime a job
is requeued in slurm (either manually or due to node failure).
Normally, submitted jobs are held until a wiki command is sent telling
slurm to launch the job.  When a job is requeued, though, slurm does not
hold the job and instead actually allocates a node for it and launches
it.  I'm expecting all requeued jobs to end up pending in the slurm
queue but instead I typically find them running (on a different node)
havoc ensues.  Here are some slurmctld logs showing an occurance of this
behavior.  I've interspersed comments for clarity.


Jul 22 11:05:30 radmin1 slurmctld[8254]: _slurm_rpc_submit_batch_job
JobId=2661723 usec=376
Jul 22 11:06:56 radmin1 slurmctld[8254]: sched: Allocate JobId=2661723
NodeList=rn359 #CPUs=1
Jul 22 11:07:23 radmin1 slurmctld[8254]: completing job 2661723
### Node rn359 becomes unresponsive
Jul 22 11:07:23 radmin1 slurmctld[8254]: Non-responding node, requeue
JobId=2661723
Jul 22 11:07:23 radmin1 slurmctld[8254]: sched: job_complete for
JobId=2661723 successful
### Slurm requeues job and allocates onto different node
Jul 22 11:07:41 radmin1 slurmctld[8254]: requeue batch job 2661723
Jul 22 11:07:51 radmin1 slurmctld[8254]: sched: Allocate JobId=2661723
NodeList=rn364 #CPUs=1
### wiki attempts to schedule job that it thought would be pending
Jul 22 11:08:53 radmin1 slurmctld[8254]: error: wiki: Attempt to start
job 2661723 in state RUNNING
### havoc ensues

How can we get the slurm requeue to always put a hold on jobs that get
requeued?

-JE







Reply via email to