Hi dev list, I’m having trouble getting preemption to work in suspend mode. I have three partitions, one with a priority of 100, one with a priority of 3, and one with a priority of 1. When users submit jobs on the 100 priority partition, a job on a lower priority partition is not suspended. I have SHARED=FORCE:1 for all partitions. Any ideas what could be happening? The default memory per CPU=2GB, so if users do not explicitly specify their memory usage, will preemption fail if the high priority job needs more memory than what is free on the node?
Thanks! Ryan My configuration is: AccountingStorageBackupHost = (null) AccountingStorageEnforce = associations,limits AccountingStorageHost = berkelium.berkelium AccountingStorageLoc = N/A AccountingStoragePort = 6819 AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = YES AuthType = auth/munge BackupAddr = (null) BackupController = (null) BatchStartTimeout = 10 sec BOOT_TIME = 2014-03-10T18:06:50 CacheGroups = 0 CheckpointType = checkpoint/none ClusterName = cluster CompleteWait = 0 sec ControlAddr = berkelium ControlMachine = berkelium CryptoType = crypto/munge DebugFlags = (null) DefMemPerCPU = 2 DisableRootJobs = NO EnforcePartLimits = NO Epilog = (null) EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) FastSchedule = 1 FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = (null) GroupUpdateForce = 0 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 0 sec HealthCheckProgram = (null) InactiveLimit = 0 sec JobAcctGatherFrequency = 30 sec JobAcctGatherType = jobacct_gather/none JobCheckpointDir = /var/slurm/checkpoint JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/filetxt JobCompUser = root JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = (null) KillOnBadExit = 0 KillWait = 30 sec Licenses = (null) MailProg = /bin/mail MaxJobCount = 10000 MaxJobId = 4294901760 MaxMemPerNode = UNLIMITED MaxStepCount = 40000 MaxTasksPerNode = 128 MessageTimeout = 10 sec MinJobAge = 300 sec MpiDefault = none MpiParams = (null) NEXT_JOB_ID = 11897 OverTimeLimit = 0 min PluginDir = /usr/lib64/slurm PlugStackConfig = /etc/slurm/plugstack.conf PreemptMode = GANG,SUSPEND PreemptType = preempt/partition_prio PriorityDecayHalfLife = 00:07:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = 0 PriorityFlags = 0 PriorityMaxAge = 7-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 1 PriorityWeightFairShare = 5 PriorityWeightJobSize = 0 PriorityWeightPartition = 100 PriorityWeightQOS = 5 PrivateData = none ProctrackType = proctrack/pgid Prolog = (null) PrologSlurmctld = (null) PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) RebootProgram = (null) ReconfigFlags = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvOverRun = 0 min ReturnToService = 1 SallocDefaultCommand = (null) SchedulerParameters = (null) SchedulerPort = 7321 SchedulerRootFilter = 1 SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill SelectType = select/cons_res SelectTypeParameters = CR_CORE_MEMORY SlurmUser = slurm(202) SlurmctldDebug = info SlurmctldLogFile = (null) SlurmSchedLogFile = (null) SlurmctldPort = 6817 SlurmctldTimeout = 120 sec SlurmdDebug = info SlurmdLogFile = (null) SlurmdPidFile = /var/run/slurmd.pid SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurmd SlurmdTimeout = 300 sec SlurmdUser = root(0) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SLURM_CONF = /etc/slurm/slurm.conf SLURM_VERSION = 2.4.3 SrunEpilog = (null) SrunProlog = (null) StateSaveLocation = /var/spool/slurmsave SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/none TaskEpilog = (null) TaskPlugin = task/affinity TaskPluginParam = (null type) TaskProlog = (null) TmpFS = /tmp TopologyPlugin = topology/none TrackWCKey = 0 TreeWidth = 50 UsePam = 0 UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec VSizeFactor = 0 percent WaitTime = 0 sec
