Re: [slurm-dev] requeue bug

jette Thu, 11 Aug 2011 08:16:54 -0700

This would be a bit different problem.

Josh,
I think this patch will fix the problem. Please report results if you try it.


diff --git a/src/slurmctld/job_mgr.c b/src/slurmctld/job_mgr.c
index 8f245cc..acf1172 100644
--- a/src/slurmctld/job_mgr.c
+++ b/src/slurmctld/job_mgr.c

@@ -9223,6 +9223,7 @@ extern int job_requeue (uid_t uid, uint32_tjob_id, slurm_

        job_ptr->suspend_time = (time_t) 0;
        job_ptr->tot_sus_time = (time_t) 0;
        job_ptr->restart_cnt++;
+       _set_job_prio(job_ptr);
        /* Since the job completion logger removes the submit we need
         * to add it again. */
        acct_policy_add_job_submit(job_ptr);


Quoting Pär Andersson <[email protected]>:

Hi Josh,

Josh England <[email protected]> writes:

We're running slurm-2.2.4 on CentOS-5.5 using sched/wiki to interface to
a custom scheduler, and there seems to be a bug happening anytime a job
is requeued in slurm (either manually or due to node failure).


This sounds familiar. I think this is the same issue that we found and
fixed for sched/wiki2 back in May (included in 2.2.6).

Please look at this thread:

https://groups.google.com/group/slurm-devel/browse_thread/thread/8c2e072f94873103?pli=1


The fix is a one-line change that resets a jobs priority in
/src/plugins/sched/wiki2/job_requeue.c when requeued:

https://github.com/paran1/slurm/commit/8212b71ec7480cf8bf292fefdb5547bc4a79dbc2

You most likely have to add something very similar to the sched/wiki
plugin code.

Regards,
Pär Andersson
NSC

Re: [slurm-dev] requeue bug

Reply via email to