Hi Devs, Sorry for bother you with this small problem, but it has not been possible for me to figure what is happening here. We are running an small cluster at the HKUST and I tried to enable the Preemption in order to suspend low priority jobs by those with high-priority. I had previously set this config using an older version of the program, but in 2.1.14 I experience the next annoying behavior:
When I submit the higher priority job, in fact it stops the lower priority one. But, latter after +/- 60 sec (as the time-slice states) it happens that booth jobs start running, i.e. the suspended low-priority job starts running/suspended in 60 sec/intervals, but the high priority job is running all the time! So by 60 sec intervals I have booth jobs overlapping. Hence it looks like it is still working like some kind of Gang, but I have the maxshare set to 1, which in theory will disable the gang and just preempt low priority jobs, but what is worst is that this is a broken-gang since the jobs overlap in a running status. More interesting is that if I look to the squeue it shows that the low priority job is in fact suspended all the time, but in the computing node it is alternating running as described. The behavior I want is to suspend the low priority job until the high priority finish. May you help me? Thanks, Daniel Silva AccountingStorageBackupHost = (null) AccountingStorageEnforce = associations,limits AccountingStorageHost = compbio0 AccountingStorageLoc = N/A AccountingStoragePass = (null) AccountingStoragePort = 7031 AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AuthType = auth/munge BackupAddr = (null) BackupController = (null) BatchStartTimeout = 10 sec BOOT_TIME = 2010-11-08T20:28:55 CacheGroups = 0 CheckpointType = checkpoint/none ClusterName = bbrc CompleteWait = 0 sec ControlAddr = 10.1.1.1 ControlMachine = compbio CryptoType = crypto/munge DebugFlags = (null) DefMemPerCPU = 500 DisableRootJobs = NO EnforcePartLimits = NO Epilog = (null) EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) FastSchedule = 1 FirstJobId = 1 GetEnvTimeout = 2 sec HealthCheckInterval = 0 sec HealthCheckProgram = (null) InactiveLimit = 0 sec JobAcctGatherFrequency = 30 sec JobAcctGatherType = jobacct_gather/none JobCheckpointDir = /var/slurm/checkpoint JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPass = (null) JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobCredentialPrivateKey = /etc/slurm/slurm.key JobCredentialPublicCertificate = /etc/slurm/slurm.cert JobFileAppend = 0 JobRequeue = 1 KillOnBadExit = 0 KillWait = 30 sec Licenses = (null) MailProg = /bin/mail MaxJobCount = 5000 MaxMemPerCPU = 2000 MaxTasksPerNode = 128 MessageTimeout = 10 sec MinJobAge = 300 sec MpiDefault = none MpiParams = (null) NEXT_JOB_ID = 18406 OverTimeLimit = 0 min PluginDir = /usr/lib64/slurm PlugStackConfig = /etc/slurm/plugstack.conf PreemptMode = GANG,SUSPEND PreemptType = preempt/partition_prio PriorityType = priority/basic PrivateData = none ProctrackType = proctrack/pgid Prolog = (null) PrologSlurmctld = (null) PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvOverRun = 0 min ReturnToService = 2 SallocDefaultCommand = (null) SchedulerParameters = (null) SchedulerPort = 7321 SchedulerRootFilter = 1 SchedulerTimeSlice = 60 sec SchedulerType = sched/backfill SelectType = select/cons_res SelectTypeParameters = CR_CORE_MEMORY SlurmUser = slurm(504) SlurmctldDebug = 3 SlurmctldLogFile = (null) SlurmctldPidFile = /var/run/slurmctld.pid SlurmctldPort = 6817 SlurmctldTimeout = 300 sec SlurmdDebug = 3 SlurmdLogFile = (null) SlurmdPidFile = /var/run/slurmd.pid SlurmdPort = 6818 SlurmdSpoolDir = /tmp/slurmd SlurmdTimeout = 300 sec SlurmdUser = root(0) SLURM_CONF = /etc/slurm/slurm.conf SLURM_VERSION = 2.1.14 SrunEpilog = (null) SrunProlog = (null) StateSaveLocation = /tmp SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/none TaskEpilog = (null) TaskPlugin = task/affinity TaskPluginParam = (null type) TaskProlog = (null) TmpFS = /tmp TopologyPlugin = topology/none TrackWCKey = 0 TreeWidth = 50 UsePam = 0 UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec WaitTime = 0 sec Slurmctld(primary/backup) at compbio/(NULL) are UP/DOWN PartitionName=ib-high Priority=5 Nodes=node-ib-[1-46] Default=NO Shared=FORCE:1 MaxTime=INFINITE State=UP AllowGroups=chemHuang PartitionName=ib-default Priority=3 Nodes=node-ib-[1-46] Default=YES Shared=FORCE:1 MaxTime=INFINITE State=UP AllowGroups=ALL
