Hi Pär, The problem fixed by the earlier patch was that when a job was suspended or requeued, its priority would be recalculated based upon the new submit/requeue time rather than the original time. We believed that preserving the original time would be better, although it caused problems for use with Moab as you observed. I believe that your patch is probably the best solution and have applied it.
Moe ________________________________________ From: [email protected] [[email protected]] On Behalf Of Pär Andersson [[email protected]] Sent: Friday, May 13, 2011 7:04 AM To: [email protected] Subject: [slurm-dev] Patch: fix wiki2 requeue problem. Hi, We discovered a scheduling problem on a cluster recently upgraded from 2.1.15 to 2.2.5, running Moab and wiki2. The cluster uses job preemption and requeueing. Requeueable jobs that is in state PENDING after having been requeued at least one time, effectively block Moab from starting any other job. I believe that the root cause is that when a job is requeued it keeps Priority=100000000, instead of being held. In the following example 1495705 gets requeued and pending. Moab then tries to start 1495733 which fails. slurmctld.log: 1495705 is requeued: [2011-05-13T13:05:53] wiki msg recv:CK=64e731ac73fec193 TS=1305284753 AUTH=moab DT=CMD=REQUEUEJOB ARG=1495705 [2011-05-13T13:05:53] wiki: requeued job 1495705 [2011-05-13T13:05:53] wiki msg send:CK=7653740ce9476756 TS=1305284753 AUTH=slurm DT=SC=0 RESPONSE=job 1495705 requeued successfully [2011-05-13T13:05:53] completing job 1495705 [2011-05-13T13:06:06] requeue batch job 1495705 ... [2011-05-13T13:08:42] wiki msg recv:CK=7189ead250b2fd76 TS=1305284922 AUTH=moab DT=CMD=STARTJOB ARG=1495733 TASKLIST=n212 [2011-05-13T13:08:42] error: wiki: Could not start job 1495733(n212): Resources [2011-05-13T13:08:42] wiki msg send:CK=4867c632e7c67a1a TS=1305284922 AUTH=slurm DT=SC=-913 RESPONSE=Could not start job 1495733(n212): Resources scheduler log lines about job 1495705 and 1495733 from the same time period: [2011-05-13T13:04:34] sched: JobId=1495705. State=PENDING. Reason=JobHeldAdmin. Priority=0. [2011-05-13T13:04:35] sched: JobId=1495705 initiated [2011-05-13T13:04:35] sched: Allocate JobId=1495705 NodeList=n[302,304-305,307,310-315] #CPUs=80 [2011-05-13T13:04:35] sched: _slurm_rpc_job_step_create: StepId=1495705.0 n[302,304-305,307,310-315] usec=267 [2011-05-13T13:04:35] sched: _slurm_rpc_step_complete StepId=1495705.0 usec=12 [2011-05-13T13:06:23] sched: JobId=1495705. State=PENDING. Reason=Resources. Priority=100000000. Partition=r_nehalem. ... [2011-05-13T13:08:38] sched: JobId=1495705. State=PENDING. Reason=Resources. Priority=100000000. Partition=r_nehalem. [2011-05-13T13:08:40] sched: JobId=1495733. State=PENDING. Reason=JobHeldAdmin. Priority=0. [2011-05-13T13:08:40] sched: JobId=1495705. State=PENDING. Reason=Resources. Priority=100000000. Partition=r_nehalem. [2011-05-13T13:08:41] sched: JobId=1495733. State=PENDING. Reason=JobHeldAdmin. Priority=0. [2011-05-13T13:08:41] sched: JobId=1495705. State=PENDING. Reason=Resources. Priority=100000000. Partition=r_nehalem. [2011-05-13T13:08:42] sched: JobId=1495705. State=PENDING. Reason=Resources. Priority=100000000. Partition=r_nehalem. [2011-05-13T13:08:42] sched: JobId=1495733. State=PENDING. Reason=Resources. Priority=100000000. Partition=nehalem. [2011-05-13T13:08:43] sched: JobId=1495705. State=PENDING. Reason=Resources. Priority=100000000. Partition=r_nehalem. [2011-05-13T13:08:43] sched: JobId=1495733. State=PENDING. Reason=Resources. Priority=100000000. Partition=nehalem. [2011-05-13T13:09:10] sched: JobId=1495733. State=PENDING. Reason=JobHeldAdmin. Priority=0. [2011-05-13T13:09:10] sched: JobId=1495705. State=PENDING. Reason=Resources. Priority=100000000. Partition=r_nehalem. [2011-05-13T13:09:44] sched: JobId=1495705. State=PENDING. Reason=Resources. Priority=100000000. Partition=r_nehalem. [2011-05-13T13:09:44] sched: JobId=1495733. State=PENDING. Reason=Resources. Priority=100000000. Partition=nehalem. [2011-05-13T13:09:44] sched: JobId=1495705. State=PENDING. Reason=Resources. Priority=100000000. Partition=r_nehalem. [2011-05-13T13:09:44] sched: JobId=1495733. State=PENDING. Reason=Resources. Priority=100000000. Partition=nehalem. I have made a patch that seems to fix the problem for us. See commit 8212b71ec7480cf8bf292fefdb5547bc4a79dbc2 on github: https://github.com/paran1/slurm/commit/8212b71ec7480cf8bf292fefdb5547bc4a79dbc2 After creating that patch I found the following commit, that sounds like it might have introduced this problem. commit 4059a9232bb0415bb40940c42fc9fbbc54a5c5a6 Author: Moe Jette <[email protected]> Date: Wed Mar 23 22:04:44 2011 +0000 -- Do not reset a job's priority when requeued or suspended. Fixes bug reported by Bill Brophy, Bull. What bug did this fix? Would reverting this be more correct than fixing it in wiki2 like my patch did? Looking at this also made me realize that priority probably needs to be reset to 0 in src/plugins/sched/wiki2/suspend_job.c as well, but we don't use job suspend so unfortunately I can't test that. Regards, Pär Andersson NSC
