Thank you for the clear explanation of the problem and files. The fix  
will be in our next release. You can get it now from github:

https://github.com/SchedMD/slurm/commit/b0f3b65194bc6ad626629a5ba1a665298257c4ec

Quoting Magnus Jonsson <[email protected]>:

> .. or do I miss something.
>
> Our slurm.conf attached to the mail.
>
> I have 2 partitions in my slurm.conf "devel" and "preemp".
>
> Devel has no preemptation and preemp has "CANCEL".
>
> The system has two nodes.
>
> If I put in a job(1) with 1 node in devel it starts normally.
> I put in an other job(2) requesting 2 nodes is gets into the queue  
> with "Resources".
>
> I put in a job(3) into preemp that backfill starts.
>
> Now the more interesting part.
>
> If I put in a job(4) into the devel partition that fits in the hole  
> from job 1 and 2. The preemptive job 3 is not cancelled.
>
> If I cancel job 3 job 4 will start _or_ if I cancel job 2 job 4 will  
> preempt job 3 and start.
>
> In the log from cons_res I see that it seems to found a place for  
> job 4 by removing the job 3 but nothing more happens (see log below).
>
> Relevant part of the submit scripts for each job (1-4) is also attached.
>
> Am I missing something important here or is this a bug?
>
> Best regards,
> Magnus
>
> ----8<--- slurmctld.log -- Job 3 is here job 241 --->8---
> [2013-02-12T14:54:48+01:00] debug2: backfill: entering _try_sched  
> for job 241.
> [2013-02-12T14:54:48+01:00] debug2: select_p_job_test for job 241
> [2013-02-12T14:54:48+01:00] cons_res: select_p_job_test: job 241  
> node_req 64000 mode 2
> [2013-02-12T14:54:48+01:00] cons_res: select_p_job_test: min_n 1  
> max_n 1 req_n 1 avail_n 2
> [2013-02-12T14:54:48+01:00] node:t-cn1033 cpus:48 c:6 s:8 t:1  
> mem:129000 a_mem:120000 state:1
> [2013-02-12T14:54:48+01:00] node:t-cn1034 cpus:48 c:6 s:8 t:1  
> mem:129000 a_mem:120000 state:64000
> [2013-02-12T14:54:48+01:00] part:devel rows:1 pri:30
> [2013-02-12T14:54:48+01:00]   row0: num_jobs 1: bitmap: 48-95
> [2013-02-12T14:54:48+01:00] part:core rows:1 pri:20
> [2013-02-12T14:54:48+01:00] part:preemp rows:1 pri:10
> [2013-02-12T14:54:48+01:00]   row0: num_jobs 1: bitmap: 0-47
> [2013-02-12T14:54:48+01:00] debug3: cons_res: _vns: node t-cn1033 non-sharing
> [2013-02-12T14:54:48+01:00] debug3: cons_res: _vns: node t-cn1034 in  
> exclusive use
> [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: evaluating job  
> 241 on 0 nodes
> [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 0 fail:  
> insufficient resources
> [2013-02-12T14:54:48+01:00] debug3: cons_res: _rm_job_from_res: job  
> 238 action 0
> [2013-02-12T14:54:48+01:00] DEBUG: Dump job_resources: nhosts 1 cb 0-47
> [2013-02-12T14:54:48+01:00] debug3: cons_res: removed job 238 from  
> part preemp row 0
> [2013-02-12T14:54:48+01:00] debug3: cons_res: _vns: node t-cn1034 in  
> exclusive use
> [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: evaluating job  
> 241 on 1 nodes
> [2013-02-12T14:54:48+01:00] cons_res: _can_job_run_on_node: 48 cpus  
> on t-cn1033(0), mem 0/129000
> [2013-02-12T14:54:48+01:00] cons_res: eval_nodes:0 consec c=48 n=1  
> b=0 e=0 r=-1
> [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 0 pass - job  
> fits on given resources
> [2013-02-12T14:54:48+01:00] cons_res: _can_job_run_on_node: 48 cpus  
> on t-cn1033(0), mem 0/129000
> [2013-02-12T14:54:48+01:00] cons_res: eval_nodes:0 consec c=48 n=1  
> b=0 e=0 r=-1
> [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 1 pass -  
> idle resources found
> [2013-02-12T14:54:48+01:00] no job_resources info for job 241
> [2013-02-12T14:54:48+01:00] debug2: select_p_job_test for job 241
> [2013-02-12T14:54:48+01:00] cons_res: select_p_job_test: job 241  
> node_req 1 mode 2
> [2013-02-12T14:54:48+01:00] cons_res: select_p_job_test: min_n 1  
> max_n 1 req_n 1 avail_n 2
> [2013-02-12T14:54:48+01:00] node:t-cn1033 cpus:48 c:6 s:8 t:1  
> mem:129000 a_mem:120000 state:1
> [2013-02-12T14:54:48+01:00] node:t-cn1034 cpus:48 c:6 s:8 t:1  
> mem:129000 a_mem:120000 state:64000
> [2013-02-12T14:54:48+01:00] part:devel rows:1 pri:30
> [2013-02-12T14:54:48+01:00]   row0: num_jobs 1: bitmap: 48-95
> [2013-02-12T14:54:48+01:00] part:core rows:1 pri:20
> [2013-02-12T14:54:48+01:00] part:preemp rows:1 pri:10
> [2013-02-12T14:54:48+01:00]   row0: num_jobs 1: bitmap: 0-47
> [2013-02-12T14:54:48+01:00] debug3: cons_res: _vns: node t-cn1034 in  
> exclusive use
> [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: evaluating job  
> 241 on 1 nodes
> [2013-02-12T14:54:48+01:00] cons_res: _can_job_run_on_node: 0 cpus  
> on t-cn1033(1), mem 120000/129000
> [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 0 fail:  
> insufficient resources
> [2013-02-12T14:54:48+01:00] debug3: cons_res: _rm_job_from_res: job  
> 238 action 0
> [2013-02-12T14:54:48+01:00] DEBUG: Dump job_resources: nhosts 1 cb 0-47
> [2013-02-12T14:54:48+01:00] debug3: cons_res: removed job 238 from  
> part preemp row 0
> [2013-02-12T14:54:48+01:00] debug3: cons_res: _vns: node t-cn1034 in  
> exclusive use
> [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: evaluating job  
> 241 on 1 nodes
> [2013-02-12T14:54:48+01:00] cons_res: _can_job_run_on_node: 48 cpus  
> on t-cn1033(0), mem 0/129000
> [2013-02-12T14:54:48+01:00] cons_res: eval_nodes:0 consec c=48 n=1  
> b=0 e=0 r=-1
> [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 0 pass - job  
> fits on given resources
> [2013-02-12T14:54:48+01:00] cons_res: _can_job_run_on_node: 48 cpus  
> on t-cn1033(0), mem 0/129000
> [2013-02-12T14:54:48+01:00] cons_res: eval_nodes:0 consec c=48 n=1  
> b=0 e=0 r=-1
> [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 1 pass -  
> idle resources found
> [2013-02-12T14:54:48+01:00] no job_resources info for job 241
> [2013-02-12T14:54:48+01:00] debug2: Testing job time limits and checkpoints
>
> ----8<---
> -- 
> Magnus Jonsson, Developer, HPC2N, UmeƄ Universitet
>

Reply via email to