Hi Daniel,

SLURM v2.2 contains quite a few bug fixes related to job preemption and gang 
scheduling.
It will almost certainly fix this problem, but if it does not then we'll work 
with you on a fix.

Moe

________________________________________
From: [email protected] [[email protected]] On Behalf 
Of Daniel Adriano Silva M [[email protected]]
Sent: Monday, March 07, 2011 10:56 PM
To: slurm-dev
Subject: [slurm-dev] Fwd: Problem with suspend (Preemption mode), requesting 
help with the configuration.

Hi Devs,

Sorry for bother you with this small problem, but it has not been possible for 
me to figure what is happening here. We are running an small cluster at the 
HKUST and I tried to enable the Preemption in order to suspend low priority 
jobs by those with high-priority. I had previously set this config using an 
older version of the program, but in 2.1.14 I experience the next annoying 
behavior:

 When I submit the higher priority job, in fact it stops the lower priority 
one. But, latter after +/- 60 sec (as the time-slice states) it happens that 
booth jobs start running, i.e. the suspended low-priority job starts 
running/suspended in 60 sec/intervals, but the high priority job is running all 
the time! So by 60 sec intervals I have booth jobs overlapping. Hence it looks 
like it is still working like some kind of Gang, but I have the maxshare set to 
1, which in theory will disable the gang and just preempt low priority jobs, 
but what is worst is that this is a broken-gang since the jobs overlap in a 
running status. More interesting is that if I look to the squeue it shows that 
the low priority job is in fact suspended all the time, but in the computing 
node it is alternating running as described. The behavior I want is to suspend 
the low priority job until the high priority finish. May you help me?


Thanks,
Daniel Silva

AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits
AccountingStorageHost   = compbio0
AccountingStorageLoc    = N/A
AccountingStoragePass   = (null)
AccountingStoragePort   = 7031
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AuthType                = auth/munge
BackupAddr              = (null)
BackupController        = (null)
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2010-11-08T20:28:55
CacheGroups             = 0
CheckpointType          = checkpoint/none
ClusterName             = bbrc
CompleteWait            = 0 sec
ControlAddr             = 10.1.1.1
ControlMachine          = compbio
CryptoType              = crypto/munge
DebugFlags              = (null)
DefMemPerCPU            = 500
DisableRootJobs         = NO
EnforcePartLimits       = NO
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
FastSchedule            = 1
FirstJobId              = 1
GetEnvTimeout           = 2 sec
HealthCheckInterval     = 0 sec
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
JobAcctGatherFrequency  = 30 sec
JobAcctGatherType       = jobacct_gather/none
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPass             = (null)
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobCredentialPrivateKey = /etc/slurm/slurm.key
JobCredentialPublicCertificate = /etc/slurm/slurm.cert
JobFileAppend           = 0
JobRequeue              = 1
KillOnBadExit           = 0
KillWait                = 30 sec
Licenses                = (null)
MailProg                = /bin/mail
MaxJobCount             = 5000
MaxMemPerCPU            = 2000
MaxTasksPerNode         = 128
MessageTimeout          = 10 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = (null)
NEXT_JOB_ID             = 18406
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = /etc/slurm/plugstack.conf
PreemptMode             = GANG,SUSPEND
PreemptType             = preempt/partition_prio
PriorityType            = priority/basic
PrivateData             = none
ProctrackType           = proctrack/pgid
Prolog                  = (null)
PrologSlurmctld         = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvOverRun             = 0 min
ReturnToService         = 2
SallocDefaultCommand    = (null)
SchedulerParameters     = (null)
SchedulerPort           = 7321
SchedulerRootFilter     = 1
SchedulerTimeSlice      = 60 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY
SlurmUser               = slurm(504)
SlurmctldDebug          = 3
SlurmctldLogFile        = (null)
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPort           = 6817
SlurmctldTimeout        = 300 sec
SlurmdDebug             = 3
SlurmdLogFile           = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /tmp/slurmd
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 2.1.14
SrunEpilog              = (null)
SrunProlog              = (null)
StateSaveLocation       = /tmp
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/affinity
TaskPluginParam         = (null type)
TaskProlog              = (null)
TmpFS                   = /tmp
TopologyPlugin          = topology/none
TrackWCKey              = 0
TreeWidth               = 50
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
WaitTime                = 0 sec

Slurmctld(primary/backup) at compbio/(NULL) are UP/DOWN

PartitionName=ib-high           Priority=5 Nodes=node-ib-[1-46]   Default=NO  
Shared=FORCE:1  MaxTime=INFINITE State=UP AllowGroups=chemHuang
PartitionName=ib-default        Priority=3 Nodes=node-ib-[1-46]   Default=YES 
Shared=FORCE:1  MaxTime=INFINITE State=UP AllowGroups=ALL




Reply via email to