.. or do I miss something. Our slurm.conf attached to the mail.
I have 2 partitions in my slurm.conf "devel" and "preemp". Devel has no preemptation and preemp has "CANCEL". The system has two nodes. If I put in a job(1) with 1 node in devel it starts normally.I put in an other job(2) requesting 2 nodes is gets into the queue with "Resources".
I put in a job(3) into preemp that backfill starts. Now the more interesting part.If I put in a job(4) into the devel partition that fits in the hole from job 1 and 2. The preemptive job 3 is not cancelled.
If I cancel job 3 job 4 will start _or_ if I cancel job 2 job 4 will preempt job 3 and start.
In the log from cons_res I see that it seems to found a place for job 4 by removing the job 3 but nothing more happens (see log below).
Relevant part of the submit scripts for each job (1-4) is also attached. Am I missing something important here or is this a bug? Best regards, Magnus ----8<--- slurmctld.log -- Job 3 is here job 241 --->8---[2013-02-12T14:54:48+01:00] debug2: backfill: entering _try_sched for job 241.
[2013-02-12T14:54:48+01:00] debug2: select_p_job_test for job 241[2013-02-12T14:54:48+01:00] cons_res: select_p_job_test: job 241 node_req 64000 mode 2 [2013-02-12T14:54:48+01:00] cons_res: select_p_job_test: min_n 1 max_n 1 req_n 1 avail_n 2 [2013-02-12T14:54:48+01:00] node:t-cn1033 cpus:48 c:6 s:8 t:1 mem:129000 a_mem:120000 state:1 [2013-02-12T14:54:48+01:00] node:t-cn1034 cpus:48 c:6 s:8 t:1 mem:129000 a_mem:120000 state:64000
[2013-02-12T14:54:48+01:00] part:devel rows:1 pri:30 [2013-02-12T14:54:48+01:00] row0: num_jobs 1: bitmap: 48-95 [2013-02-12T14:54:48+01:00] part:core rows:1 pri:20 [2013-02-12T14:54:48+01:00] part:preemp rows:1 pri:10 [2013-02-12T14:54:48+01:00] row0: num_jobs 1: bitmap: 0-47[2013-02-12T14:54:48+01:00] debug3: cons_res: _vns: node t-cn1033 non-sharing [2013-02-12T14:54:48+01:00] debug3: cons_res: _vns: node t-cn1034 in exclusive use [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: evaluating job 241 on 0 nodes [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 0 fail: insufficient resources [2013-02-12T14:54:48+01:00] debug3: cons_res: _rm_job_from_res: job 238 action 0
[2013-02-12T14:54:48+01:00] DEBUG: Dump job_resources: nhosts 1 cb 0-47[2013-02-12T14:54:48+01:00] debug3: cons_res: removed job 238 from part preemp row 0 [2013-02-12T14:54:48+01:00] debug3: cons_res: _vns: node t-cn1034 in exclusive use [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: evaluating job 241 on 1 nodes [2013-02-12T14:54:48+01:00] cons_res: _can_job_run_on_node: 48 cpus on t-cn1033(0), mem 0/129000 [2013-02-12T14:54:48+01:00] cons_res: eval_nodes:0 consec c=48 n=1 b=0 e=0 r=-1 [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 0 pass - job fits on given resources [2013-02-12T14:54:48+01:00] cons_res: _can_job_run_on_node: 48 cpus on t-cn1033(0), mem 0/129000 [2013-02-12T14:54:48+01:00] cons_res: eval_nodes:0 consec c=48 n=1 b=0 e=0 r=-1 [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 1 pass - idle resources found
[2013-02-12T14:54:48+01:00] no job_resources info for job 241 [2013-02-12T14:54:48+01:00] debug2: select_p_job_test for job 241[2013-02-12T14:54:48+01:00] cons_res: select_p_job_test: job 241 node_req 1 mode 2 [2013-02-12T14:54:48+01:00] cons_res: select_p_job_test: min_n 1 max_n 1 req_n 1 avail_n 2 [2013-02-12T14:54:48+01:00] node:t-cn1033 cpus:48 c:6 s:8 t:1 mem:129000 a_mem:120000 state:1 [2013-02-12T14:54:48+01:00] node:t-cn1034 cpus:48 c:6 s:8 t:1 mem:129000 a_mem:120000 state:64000
[2013-02-12T14:54:48+01:00] part:devel rows:1 pri:30 [2013-02-12T14:54:48+01:00] row0: num_jobs 1: bitmap: 48-95 [2013-02-12T14:54:48+01:00] part:core rows:1 pri:20 [2013-02-12T14:54:48+01:00] part:preemp rows:1 pri:10 [2013-02-12T14:54:48+01:00] row0: num_jobs 1: bitmap: 0-47[2013-02-12T14:54:48+01:00] debug3: cons_res: _vns: node t-cn1034 in exclusive use [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: evaluating job 241 on 1 nodes [2013-02-12T14:54:48+01:00] cons_res: _can_job_run_on_node: 0 cpus on t-cn1033(1), mem 120000/129000 [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 0 fail: insufficient resources [2013-02-12T14:54:48+01:00] debug3: cons_res: _rm_job_from_res: job 238 action 0
[2013-02-12T14:54:48+01:00] DEBUG: Dump job_resources: nhosts 1 cb 0-47[2013-02-12T14:54:48+01:00] debug3: cons_res: removed job 238 from part preemp row 0 [2013-02-12T14:54:48+01:00] debug3: cons_res: _vns: node t-cn1034 in exclusive use [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: evaluating job 241 on 1 nodes [2013-02-12T14:54:48+01:00] cons_res: _can_job_run_on_node: 48 cpus on t-cn1033(0), mem 0/129000 [2013-02-12T14:54:48+01:00] cons_res: eval_nodes:0 consec c=48 n=1 b=0 e=0 r=-1 [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 0 pass - job fits on given resources [2013-02-12T14:54:48+01:00] cons_res: _can_job_run_on_node: 48 cpus on t-cn1033(0), mem 0/129000 [2013-02-12T14:54:48+01:00] cons_res: eval_nodes:0 consec c=48 n=1 b=0 e=0 r=-1 [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 1 pass - idle resources found
[2013-02-12T14:54:48+01:00] no job_resources info for job 241 [2013-02-12T14:54:48+01:00] debug2: Testing job time limits and checkpoints ----8<--- -- Magnus Jonsson, Developer, HPC2N, UmeƄ Universitet
#
# See the slurm.conf man page for more information.
#
ControlMachine=slurm-kvm
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
DisableRootJobs=YES
EnforcePartLimits=YES
MailProg=/usr/bin/mail
MpiDefault=openmpi
MpiParams=ports=12000-12999
ProctrackType=proctrack/cgroup
PropagateResourceLimitsExcept=CPU,MEMLOCK
ReturnToService=1
SlurmctldPort=6817
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm
SwitchType=switch/none
TaskPlugin=task/cgroup,task/affinity
TmpFs=/scratch
UsePAM=1
HealthCheckInterval=3600
HealthCheckProgram=/var/conf/slurm/hpc2n-healthcheck
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=60
# SCHEDULING
DefMemPerCPU=2500
FastSchedule=2
MaxMemPerCPU=2500
SchedulerType=sched/backfill
SchedulerParameters=max_job_bf=2000,bf_window=20160,default_queue_depth=2000
#
SelectType=select/cons_res
SelectTypeParameters=CR_Socket_Memory,CR_CORE_DEFAULT_DIST_BLOCK
# JOB PRIORITY
PriorityType=priority/multifactor
PriorityDecayHalfLife=50-0
PriorityWeightFairshare=1000000
PriorityWeightPartition=10000
# LOGGING AND ACCOUNTING
AccountingStorageEnforce=associations,limits,qos
AccountingStorageHost=slurm-kvm
AccountingStorageType=accounting_storage/slurmdbd
ClusterName=slurmtestcluster
DebugFlags=CPU_Bind
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=7
SlurmdDebug=7
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmSchedLogFile=/var/log/slurm/slurmsched.log
# COMPUTE NODES
# DEVEL
NodeName=t-cn[1033-1034] RealMemory=129000 Sockets=8 CoresPerSocket=6
# Partition Configurations
PartitionName=devel Nodes=t-cn103[3,4] Default=YES DefaultTime=30:00
MaxTime=5-0 Priority=30 PreemptMode=OFF
PartitionName=core Nodes=t-cn103[3,4] DefaultTime=30:00
MaxTime=5-0 Priority=20 PreemptMode=OFF
PartitionName=preemp Nodes=t-cn103[3,4]
Priority=10 PreemptMode=CANCEL GraceTime=15
PreemptType=preempt/partition_prio
PreemptMode=CANCEL
#!/bin/bash #SBATCH -p devel #SBATCH --time=05:00:00 #SBATCH -N1 #SBATCH --exclusive srun -n1 ./job.pl
#!/bin/bash #SBATCH -p devel #SBATCH --time=01:00:00 #SBATCH -N2 #SBATCH --exclusive srun -n1 ./job.pl
#!/bin/bash #SBATCH -p preemp #SBATCH --time=01:00:00 #SBATCH -N1 #SBATCH -n48 srun -n1 ./job.pl
#!/bin/bash
#SBATCH -p devel
#SBATCH --time=04:00:00
#SBATCH --signal USR1@60
#SBATCH -N1
#SBATCH -n48
# #SBATCH --exclusive
if [ "$SLURM_JOBID" = "" ]; then
echo "Using sbatch to submit job"
sbatch $0
exit 0
fi
srun -n1 ./job.pl
smime.p7s
Description: S/MIME Cryptographic Signature
