Thank you for the clear explanation of the problem and files. The fix will be in our next release. You can get it now from github:
https://github.com/SchedMD/slurm/commit/b0f3b65194bc6ad626629a5ba1a665298257c4ec Quoting Magnus Jonsson <[email protected]>: > .. or do I miss something. > > Our slurm.conf attached to the mail. > > I have 2 partitions in my slurm.conf "devel" and "preemp". > > Devel has no preemptation and preemp has "CANCEL". > > The system has two nodes. > > If I put in a job(1) with 1 node in devel it starts normally. > I put in an other job(2) requesting 2 nodes is gets into the queue > with "Resources". > > I put in a job(3) into preemp that backfill starts. > > Now the more interesting part. > > If I put in a job(4) into the devel partition that fits in the hole > from job 1 and 2. The preemptive job 3 is not cancelled. > > If I cancel job 3 job 4 will start _or_ if I cancel job 2 job 4 will > preempt job 3 and start. > > In the log from cons_res I see that it seems to found a place for > job 4 by removing the job 3 but nothing more happens (see log below). > > Relevant part of the submit scripts for each job (1-4) is also attached. > > Am I missing something important here or is this a bug? > > Best regards, > Magnus > > ----8<--- slurmctld.log -- Job 3 is here job 241 --->8--- > [2013-02-12T14:54:48+01:00] debug2: backfill: entering _try_sched > for job 241. > [2013-02-12T14:54:48+01:00] debug2: select_p_job_test for job 241 > [2013-02-12T14:54:48+01:00] cons_res: select_p_job_test: job 241 > node_req 64000 mode 2 > [2013-02-12T14:54:48+01:00] cons_res: select_p_job_test: min_n 1 > max_n 1 req_n 1 avail_n 2 > [2013-02-12T14:54:48+01:00] node:t-cn1033 cpus:48 c:6 s:8 t:1 > mem:129000 a_mem:120000 state:1 > [2013-02-12T14:54:48+01:00] node:t-cn1034 cpus:48 c:6 s:8 t:1 > mem:129000 a_mem:120000 state:64000 > [2013-02-12T14:54:48+01:00] part:devel rows:1 pri:30 > [2013-02-12T14:54:48+01:00] row0: num_jobs 1: bitmap: 48-95 > [2013-02-12T14:54:48+01:00] part:core rows:1 pri:20 > [2013-02-12T14:54:48+01:00] part:preemp rows:1 pri:10 > [2013-02-12T14:54:48+01:00] row0: num_jobs 1: bitmap: 0-47 > [2013-02-12T14:54:48+01:00] debug3: cons_res: _vns: node t-cn1033 non-sharing > [2013-02-12T14:54:48+01:00] debug3: cons_res: _vns: node t-cn1034 in > exclusive use > [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: evaluating job > 241 on 0 nodes > [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 0 fail: > insufficient resources > [2013-02-12T14:54:48+01:00] debug3: cons_res: _rm_job_from_res: job > 238 action 0 > [2013-02-12T14:54:48+01:00] DEBUG: Dump job_resources: nhosts 1 cb 0-47 > [2013-02-12T14:54:48+01:00] debug3: cons_res: removed job 238 from > part preemp row 0 > [2013-02-12T14:54:48+01:00] debug3: cons_res: _vns: node t-cn1034 in > exclusive use > [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: evaluating job > 241 on 1 nodes > [2013-02-12T14:54:48+01:00] cons_res: _can_job_run_on_node: 48 cpus > on t-cn1033(0), mem 0/129000 > [2013-02-12T14:54:48+01:00] cons_res: eval_nodes:0 consec c=48 n=1 > b=0 e=0 r=-1 > [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 0 pass - job > fits on given resources > [2013-02-12T14:54:48+01:00] cons_res: _can_job_run_on_node: 48 cpus > on t-cn1033(0), mem 0/129000 > [2013-02-12T14:54:48+01:00] cons_res: eval_nodes:0 consec c=48 n=1 > b=0 e=0 r=-1 > [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 1 pass - > idle resources found > [2013-02-12T14:54:48+01:00] no job_resources info for job 241 > [2013-02-12T14:54:48+01:00] debug2: select_p_job_test for job 241 > [2013-02-12T14:54:48+01:00] cons_res: select_p_job_test: job 241 > node_req 1 mode 2 > [2013-02-12T14:54:48+01:00] cons_res: select_p_job_test: min_n 1 > max_n 1 req_n 1 avail_n 2 > [2013-02-12T14:54:48+01:00] node:t-cn1033 cpus:48 c:6 s:8 t:1 > mem:129000 a_mem:120000 state:1 > [2013-02-12T14:54:48+01:00] node:t-cn1034 cpus:48 c:6 s:8 t:1 > mem:129000 a_mem:120000 state:64000 > [2013-02-12T14:54:48+01:00] part:devel rows:1 pri:30 > [2013-02-12T14:54:48+01:00] row0: num_jobs 1: bitmap: 48-95 > [2013-02-12T14:54:48+01:00] part:core rows:1 pri:20 > [2013-02-12T14:54:48+01:00] part:preemp rows:1 pri:10 > [2013-02-12T14:54:48+01:00] row0: num_jobs 1: bitmap: 0-47 > [2013-02-12T14:54:48+01:00] debug3: cons_res: _vns: node t-cn1034 in > exclusive use > [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: evaluating job > 241 on 1 nodes > [2013-02-12T14:54:48+01:00] cons_res: _can_job_run_on_node: 0 cpus > on t-cn1033(1), mem 120000/129000 > [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 0 fail: > insufficient resources > [2013-02-12T14:54:48+01:00] debug3: cons_res: _rm_job_from_res: job > 238 action 0 > [2013-02-12T14:54:48+01:00] DEBUG: Dump job_resources: nhosts 1 cb 0-47 > [2013-02-12T14:54:48+01:00] debug3: cons_res: removed job 238 from > part preemp row 0 > [2013-02-12T14:54:48+01:00] debug3: cons_res: _vns: node t-cn1034 in > exclusive use > [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: evaluating job > 241 on 1 nodes > [2013-02-12T14:54:48+01:00] cons_res: _can_job_run_on_node: 48 cpus > on t-cn1033(0), mem 0/129000 > [2013-02-12T14:54:48+01:00] cons_res: eval_nodes:0 consec c=48 n=1 > b=0 e=0 r=-1 > [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 0 pass - job > fits on given resources > [2013-02-12T14:54:48+01:00] cons_res: _can_job_run_on_node: 48 cpus > on t-cn1033(0), mem 0/129000 > [2013-02-12T14:54:48+01:00] cons_res: eval_nodes:0 consec c=48 n=1 > b=0 e=0 r=-1 > [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 1 pass - > idle resources found > [2013-02-12T14:54:48+01:00] no job_resources info for job 241 > [2013-02-12T14:54:48+01:00] debug2: Testing job time limits and checkpoints > > ----8<--- > -- > Magnus Jonsson, Developer, HPC2N, UmeƄ Universitet >
