Moe, Thanks, I will compile the new version and let you know if this has been fixed. BTW, I have a second question if I change the version of slurm, will it be necessary to recompile those programs that use srun as mpi?
Daniel 2011/3/8 Jette, Moe <[email protected]> > Hi Daniel, > > SLURM v2.2 contains quite a few bug fixes related to job preemption and > gang scheduling. > It will almost certainly fix this problem, but if it does not then we'll > work with you on a fix. > > Moe > > ________________________________________ > From: [email protected] [[email protected]] On > Behalf Of Daniel Adriano Silva M [[email protected]] > Sent: Monday, March 07, 2011 10:56 PM > To: slurm-dev > Subject: [slurm-dev] Fwd: Problem with suspend (Preemption mode), > requesting help with the configuration. > > Hi Devs, > > Sorry for bother you with this small problem, but it has not been possible > for me to figure what is happening here. We are running an small cluster at > the HKUST and I tried to enable the Preemption in order to suspend low > priority jobs by those with high-priority. I had previously set this config > using an older version of the program, but in 2.1.14 I experience the next > annoying behavior: > > When I submit the higher priority job, in fact it stops the lower priority > one. But, latter after +/- 60 sec (as the time-slice states) it happens that > booth jobs start running, i.e. the suspended low-priority job starts > running/suspended in 60 sec/intervals, but the high priority job is running > all the time! So by 60 sec intervals I have booth jobs overlapping. Hence it > looks like it is still working like some kind of Gang, but I have the > maxshare set to 1, which in theory will disable the gang and just preempt > low priority jobs, but what is worst is that this is a broken-gang since the > jobs overlap in a running status. More interesting is that if I look to the > squeue it shows that the low priority job is in fact suspended all the time, > but in the computing node it is alternating running as described. The > behavior I want is to suspend the low priority job until the high priority > finish. May you help me? > > > Thanks, > Daniel Silva > > AccountingStorageBackupHost = (null) > AccountingStorageEnforce = associations,limits > AccountingStorageHost = compbio0 > AccountingStorageLoc = N/A > AccountingStoragePass = (null) > AccountingStoragePort = 7031 > AccountingStorageType = accounting_storage/slurmdbd > AccountingStorageUser = N/A > AuthType = auth/munge > BackupAddr = (null) > BackupController = (null) > BatchStartTimeout = 10 sec > BOOT_TIME = 2010-11-08T20:28:55 > CacheGroups = 0 > CheckpointType = checkpoint/none > ClusterName = bbrc > CompleteWait = 0 sec > ControlAddr = 10.1.1.1 > ControlMachine = compbio > CryptoType = crypto/munge > DebugFlags = (null) > DefMemPerCPU = 500 > DisableRootJobs = NO > EnforcePartLimits = NO > Epilog = (null) > EpilogMsgTime = 2000 usec > EpilogSlurmctld = (null) > FastSchedule = 1 > FirstJobId = 1 > GetEnvTimeout = 2 sec > HealthCheckInterval = 0 sec > HealthCheckProgram = (null) > InactiveLimit = 0 sec > JobAcctGatherFrequency = 30 sec > JobAcctGatherType = jobacct_gather/none > JobCheckpointDir = /var/slurm/checkpoint > JobCompHost = localhost > JobCompLoc = /var/log/slurm_jobcomp.log > JobCompPass = (null) > JobCompPort = 0 > JobCompType = jobcomp/none > JobCompUser = root > JobCredentialPrivateKey = /etc/slurm/slurm.key > JobCredentialPublicCertificate = /etc/slurm/slurm.cert > JobFileAppend = 0 > JobRequeue = 1 > KillOnBadExit = 0 > KillWait = 30 sec > Licenses = (null) > MailProg = /bin/mail > MaxJobCount = 5000 > MaxMemPerCPU = 2000 > MaxTasksPerNode = 128 > MessageTimeout = 10 sec > MinJobAge = 300 sec > MpiDefault = none > MpiParams = (null) > NEXT_JOB_ID = 18406 > OverTimeLimit = 0 min > PluginDir = /usr/lib64/slurm > PlugStackConfig = /etc/slurm/plugstack.conf > PreemptMode = GANG,SUSPEND > PreemptType = preempt/partition_prio > PriorityType = priority/basic > PrivateData = none > ProctrackType = proctrack/pgid > Prolog = (null) > PrologSlurmctld = (null) > PropagatePrioProcess = 0 > PropagateResourceLimits = ALL > PropagateResourceLimitsExcept = (null) > ResumeProgram = (null) > ResumeRate = 300 nodes/min > ResumeTimeout = 60 sec > ResvOverRun = 0 min > ReturnToService = 2 > SallocDefaultCommand = (null) > SchedulerParameters = (null) > SchedulerPort = 7321 > SchedulerRootFilter = 1 > SchedulerTimeSlice = 60 sec > SchedulerType = sched/backfill > SelectType = select/cons_res > SelectTypeParameters = CR_CORE_MEMORY > SlurmUser = slurm(504) > SlurmctldDebug = 3 > SlurmctldLogFile = (null) > SlurmctldPidFile = /var/run/slurmctld.pid > SlurmctldPort = 6817 > SlurmctldTimeout = 300 sec > SlurmdDebug = 3 > SlurmdLogFile = (null) > SlurmdPidFile = /var/run/slurmd.pid > SlurmdPort = 6818 > SlurmdSpoolDir = /tmp/slurmd > SlurmdTimeout = 300 sec > SlurmdUser = root(0) > SLURM_CONF = /etc/slurm/slurm.conf > SLURM_VERSION = 2.1.14 > SrunEpilog = (null) > SrunProlog = (null) > StateSaveLocation = /tmp > SuspendExcNodes = (null) > SuspendExcParts = (null) > SuspendProgram = (null) > SuspendRate = 60 nodes/min > SuspendTime = NONE > SuspendTimeout = 30 sec > SwitchType = switch/none > TaskEpilog = (null) > TaskPlugin = task/affinity > TaskPluginParam = (null type) > TaskProlog = (null) > TmpFS = /tmp > TopologyPlugin = topology/none > TrackWCKey = 0 > TreeWidth = 50 > UsePam = 0 > UnkillableStepProgram = (null) > UnkillableStepTimeout = 60 sec > WaitTime = 0 sec > > Slurmctld(primary/backup) at compbio/(NULL) are UP/DOWN > > PartitionName=ib-high Priority=5 Nodes=node-ib-[1-46] > Default=NO Shared=FORCE:1 MaxTime=INFINITE State=UP AllowGroups=chemHuang > PartitionName=ib-default Priority=3 Nodes=node-ib-[1-46] > Default=YES Shared=FORCE:1 MaxTime=INFINITE State=UP AllowGroups=ALL > > > > >
