Dear List, I have a problem with job preemting. Slurm kills low priority jobs, instead of suspending them. I read about salloc/srun kills the jobs, instead of suspending them, but as far as I can tell, the users submit their jobs via sbatch.
The cluster is using slurm 14.11.4 at the moment. On the head node, I get the following log entry: Jul 06 12:16:01 snowden slurmctld[1157]: email msg to [email protected]: SLURM Job_id=1017179 Name=AcPro-blebb_F_c-2 Ended, Run time 01:03:08, PREEMPTED, ExitCode 0 Jul 06 12:16:01 snowden slurmctld[1157]: _job_signal: 9 of running JobID=1017179 State=0x8008 NodeCnt=1 successful 0x8008 Jul 06 12:16:01 snowden slurmctld[1157]: preempted job 1017179 had to be killed On the compute node: Jul 06 11:12:54 leak4 slurmd[342]: task_p_slurmd_batch_request: 1017179 Jul 06 11:12:54 leak4 slurmd[342]: task/affinity: job 1017179 CPU input mask for node: 0x0002 Jul 06 11:12:54 leak4 slurmd[342]: task/affinity: job 1017179 CPU final HW mask for node: 0x0002 Jul 06 11:12:55 leak4 slurmd[342]: _run_prolog: prolog with lock for job 1017179 ran for 1 seconds Jul 06 11:12:56 leak4 slurmd[342]: Launching batch job 1017179 for UID 1611 Jul 06 11:12:56 leak4 [539]: [1017179]: pam_unix(slurm:session):session opened for user xxx by (uid=0) Jul 06 11:13:00 leak4 slurmd[342]: launch task 1017179.0 request from [email protected] (port 48084) Jul 06 11:13:00 leak4 slurmd[342]: lllp_distribution jobid [1017179] implicit auto binding: cores,one_thread, dist 1 Jul 06 11:13:01 leak4 [940]: [1017179.0]: pam_unix(slurm:session):session opened for user xxx by (uid=0) [...] Jul 06 12:16:01 leak4 slurmstepd[940]: error: *** STEP 1017179.0 CANCELLED AT 2015-07-06T12:16:01 DUE TO PREEMPTION on leak4 *** Jul 06 12:16:01 leak4 slurmstepd[539]: error: *** JOB 1017179 CANCELLED AT 2015-07-06T12:16:01 DUE TO PREEMPTION on leak4 *** What could be the reason for killing instead of suspending? Best, Olaf ControlMachine=snowden AuthType=auth/munge CacheGroups=0 CryptoType=crypto/munge MailProg=/usr/bin/mail MaxJobCount=5000 MpiDefault=none MpiParams=ports=12000-12999 ProctrackType=proctrack/linuxproc RebootProgram=/sbin/reboot ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurm/slurmd SlurmUser=slurm StateSaveLocation=/var/tmp/slurm/slurmd SwitchType=switch/none TaskPlugin=task/affinity UsePAM=1 InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 DefMemPerCPU=4000 FastSchedule=0 SchedulerTimeSlice=60 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory SchedulerParameters=max_job_bf=100,bf_interval=60 PreemptType=preempt/partition_prio PreemptMode=Suspend,Gang PriorityType=priority/multifactor PriorityFlags=TICKET_BASED PriorityDecayHalfLife=14-0 PriorityFavorSmall=NO PriorityMaxAge=14-0 PriorityWeightAge=1000 PriorityWeightFairshare=10000 PriorityWeightJobSize=1000 PriorityWeightPartition=1000 AccountingStorageEnforce=limits AccountingStorageHost=localhost AccountingStorageLoc=/var/log/slurm/accounting.txt AccountingStorageType=accounting_storage/slurmdbd AccountingStoreJobComment=YES ClusterName=cluster JobCompType=jobcomp_none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=verbose SlurmdDebug=3 NodeName=leak[1-40] Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 State=UNKNOWN NodeName=leak[41-48] Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN NodeName=leak[49-56] Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN NodeName=leak[57-64] Sockets=2 CoresPerSocket=14 ThreadsPerCore=1 State=UNKNOWN PartitionName=DEFAULT Shared=FORCE:1 Nodes=leak[1-64] DefaultTime=1:00:00 MaxTime=INFINITE PartitionName=onenode Priority=1 Default=YES PreemptMode=Suspend,Gang PartitionName=mpi Priority=100 Default=NO PreemptMode=off Nodes=leak[1-16,41-64] -- /************************************************************ * Olaf Leidinger <[email protected]> * Theoretische Physik - Universität des Saarlandes * Geb. E2.6 - Raum 4.01 * Tel. (0/+49) 681 302-57416 ************************************************************/
