Dear List,

I have a problem with job preemting. Slurm kills low priority jobs, instead 
of suspending them. I read about salloc/srun kills the jobs, instead of 
suspending
them, but as far as I can tell, the users submit their jobs via sbatch.

The cluster is using slurm 14.11.4 at the moment.

On the head node, I get the following log entry:

Jul 06 12:16:01 snowden slurmctld[1157]: email msg to [email protected]: SLURM 
Job_id=1017179 Name=AcPro-blebb_F_c-2 Ended, Run time 01:03:08, PREEMPTED, 
ExitCode 0
Jul 06 12:16:01 snowden slurmctld[1157]: _job_signal: 9 of running 
JobID=1017179 State=0x8008 NodeCnt=1 successful 0x8008
Jul 06 12:16:01 snowden slurmctld[1157]: preempted job 1017179 had to be killed


On the compute node:

Jul 06 11:12:54 leak4 slurmd[342]: task_p_slurmd_batch_request: 1017179
Jul 06 11:12:54 leak4 slurmd[342]: task/affinity: job 1017179 CPU input mask 
for node: 0x0002
Jul 06 11:12:54 leak4 slurmd[342]: task/affinity: job 1017179 CPU final HW mask 
for node: 0x0002
Jul 06 11:12:55 leak4 slurmd[342]: _run_prolog: prolog with lock for job 
1017179 ran for 1 seconds
Jul 06 11:12:56 leak4 slurmd[342]: Launching batch job 1017179 for UID 1611
Jul 06 11:12:56 leak4 [539]: [1017179]: pam_unix(slurm:session):session opened 
for user xxx by (uid=0)
Jul 06 11:13:00 leak4 slurmd[342]: launch task 1017179.0 request from 
[email protected] (port 48084)
Jul 06 11:13:00 leak4 slurmd[342]: lllp_distribution jobid [1017179] implicit 
auto binding: cores,one_thread, dist 1
Jul 06 11:13:01 leak4 [940]: [1017179.0]: pam_unix(slurm:session):session 
opened for user xxx by (uid=0)
[...]
Jul 06 12:16:01 leak4 slurmstepd[940]: error: *** STEP 1017179.0 CANCELLED AT 
2015-07-06T12:16:01 DUE TO PREEMPTION on leak4 ***
Jul 06 12:16:01 leak4 slurmstepd[539]: error: *** JOB 1017179 CANCELLED AT 
2015-07-06T12:16:01 DUE TO PREEMPTION on leak4 ***

What could be the reason for killing instead of suspending?

Best,

Olaf

ControlMachine=snowden
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
MailProg=/usr/bin/mail
MaxJobCount=5000
MpiDefault=none
MpiParams=ports=12000-12999

ProctrackType=proctrack/linuxproc

RebootProgram=/sbin/reboot
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/tmp/slurm/slurmd
SwitchType=switch/none

TaskPlugin=task/affinity

UsePAM=1
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
DefMemPerCPU=4000
FastSchedule=0
SchedulerTimeSlice=60
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
SchedulerParameters=max_job_bf=100,bf_interval=60

PreemptType=preempt/partition_prio
PreemptMode=Suspend,Gang

PriorityType=priority/multifactor
PriorityFlags=TICKET_BASED
PriorityDecayHalfLife=14-0
PriorityFavorSmall=NO
PriorityMaxAge=14-0
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000

AccountingStorageEnforce=limits
AccountingStorageHost=localhost
AccountingStorageLoc=/var/log/slurm/accounting.txt
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES

ClusterName=cluster
JobCompType=jobcomp_none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=verbose
SlurmdDebug=3

NodeName=leak[1-40]   Sockets=2 CoresPerSocket=8   ThreadsPerCore=1 
State=UNKNOWN 
NodeName=leak[41-48]  Sockets=2 CoresPerSocket=10  ThreadsPerCore=2 
State=UNKNOWN 
NodeName=leak[49-56]  Sockets=2 CoresPerSocket=8   ThreadsPerCore=2 
State=UNKNOWN
NodeName=leak[57-64]  Sockets=2 CoresPerSocket=14  ThreadsPerCore=1 
State=UNKNOWN

PartitionName=DEFAULT Shared=FORCE:1 Nodes=leak[1-64] DefaultTime=1:00:00 
MaxTime=INFINITE 
PartitionName=onenode Priority=1   Default=YES PreemptMode=Suspend,Gang
PartitionName=mpi     Priority=100 Default=NO  PreemptMode=off  
Nodes=leak[1-16,41-64]



-- 
/************************************************************
 * Olaf Leidinger <[email protected]>
 * Theoretische Physik - Universität des Saarlandes
 * Geb. E2.6 - Raum 4.01
 * Tel. (0/+49) 681 302-57416
 ************************************************************/

Reply via email to