All of the SLURM commands support multiple versions of the RPCs
and state files. That was added to support the commands operating
between clusters, which are not practical to all update at the
same time.
You can add new partitions or change partition settings either by
changing slurm.conf and reconfiguring or make changes directly
using the scontrol command update options (see "man scontrol").
Quoting Daniel Adriano Silva M <[email protected]>:
Hi,
Thanks for the answer, but I got another question from it. Usually when I
modify something in the slurm I run reconfigure for the changes to take
place, however you point that I should not even lose running jobs, how that
works? I mean, more generally, is it possible to change configurations like
create new partitions or change partitions settings without destroy actually
running jobs?
Thanks,
Daniel
2011/3/9 Jette, Moe <[email protected]>
Daniel,
You should not even lose any running jobs from the upgrade.
The only possible problem is any of your applications that link
directly to libslurm (i.e. directly use the SLURM APIs).
Moe
From: [email protected] [[email protected]] On
Behalf Of Daniel Adriano Silva M [[email protected]]
Sent: Tuesday, March 08, 2011 8:48 PM
To: [email protected]
Subject: Re: [slurm-dev] Fwd: Problem with suspend (Preemption mode),
requesting help with the configuration.
Moe,
Thanks, I will compile the new version and let you know if this has been
fixed. BTW, I have a second question if I change the version of slurm, will
it be necessary to recompile those programs that use srun as mpi?
Daniel
2011/3/8 Jette, Moe <[email protected]<mailto:[email protected]>>
Hi Daniel,
SLURM v2.2 contains quite a few bug fixes related to job preemption and
gang scheduling.
It will almost certainly fix this problem, but if it does not then we'll
work with you on a fix.
Moe
________________________________________
From: [email protected]<mailto:[email protected]>
[[email protected]<mailto:[email protected]>] On
Behalf Of Daniel Adriano Silva M [[email protected]<mailto:
[email protected]>]
Sent: Monday, March 07, 2011 10:56 PM
To: slurm-dev
Subject: [slurm-dev] Fwd: Problem with suspend (Preemption mode),
requesting help with the configuration.
Hi Devs,
Sorry for bother you with this small problem, but it has not been possible
for me to figure what is happening here. We are running an small cluster at
the HKUST and I tried to enable the Preemption in order to suspend low
priority jobs by those with high-priority. I had previously set this config
using an older version of the program, but in 2.1.14 I experience the next
annoying behavior:
When I submit the higher priority job, in fact it stops the lower priority
one. But, latter after +/- 60 sec (as the time-slice states) it happens that
booth jobs start running, i.e. the suspended low-priority job starts
running/suspended in 60 sec/intervals, but the high priority job is running
all the time! So by 60 sec intervals I have booth jobs overlapping. Hence it
looks like it is still working like some kind of Gang, but I have the
maxshare set to 1, which in theory will disable the gang and just preempt
low priority jobs, but what is worst is that this is a broken-gang since the
jobs overlap in a running status. More interesting is that if I look to the
squeue it shows that the low priority job is in fact suspended all the time,
but in the computing node it is alternating running as described. The
behavior I want is to suspend the low priority job until the high priority
finish. May you help me?
Thanks,
Daniel Silva
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits
AccountingStorageHost = compbio0
AccountingStorageLoc = N/A
AccountingStoragePass = (null)
AccountingStoragePort = 7031
AccountingStorageType = accounting_storage/slurmdbd
AccountingStorageUser = N/A
AuthType = auth/munge
BackupAddr = (null)
BackupController = (null)
BatchStartTimeout = 10 sec
BOOT_TIME = 2010-11-08T20:28:55
CacheGroups = 0
CheckpointType = checkpoint/none
ClusterName = bbrc
CompleteWait = 0 sec
ControlAddr = 10.1.1.1
ControlMachine = compbio
CryptoType = crypto/munge
DebugFlags = (null)
DefMemPerCPU = 500
DisableRootJobs = NO
EnforcePartLimits = NO
Epilog = (null)
EpilogMsgTime = 2000 usec
EpilogSlurmctld = (null)
FastSchedule = 1
FirstJobId = 1
GetEnvTimeout = 2 sec
HealthCheckInterval = 0 sec
HealthCheckProgram = (null)
InactiveLimit = 0 sec
JobAcctGatherFrequency = 30 sec
JobAcctGatherType = jobacct_gather/none
JobCheckpointDir = /var/slurm/checkpoint
JobCompHost = localhost
JobCompLoc = /var/log/slurm_jobcomp.log
JobCompPass = (null)
JobCompPort = 0
JobCompType = jobcomp/none
JobCompUser = root
JobCredentialPrivateKey = /etc/slurm/slurm.key
JobCredentialPublicCertificate = /etc/slurm/slurm.cert
JobFileAppend = 0
JobRequeue = 1
KillOnBadExit = 0
KillWait = 30 sec
Licenses = (null)
MailProg = /bin/mail
MaxJobCount = 5000
MaxMemPerCPU = 2000
MaxTasksPerNode = 128
MessageTimeout = 10 sec
MinJobAge = 300 sec
MpiDefault = none
MpiParams = (null)
NEXT_JOB_ID = 18406
OverTimeLimit = 0 min
PluginDir = /usr/lib64/slurm
PlugStackConfig = /etc/slurm/plugstack.conf
PreemptMode = GANG,SUSPEND
PreemptType = preempt/partition_prio
PriorityType = priority/basic
PrivateData = none
ProctrackType = proctrack/pgid
Prolog = (null)
PrologSlurmctld = (null)
PropagatePrioProcess = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
ResumeProgram = (null)
ResumeRate = 300 nodes/min
ResumeTimeout = 60 sec
ResvOverRun = 0 min
ReturnToService = 2
SallocDefaultCommand = (null)
SchedulerParameters = (null)
SchedulerPort = 7321
SchedulerRootFilter = 1
SchedulerTimeSlice = 60 sec
SchedulerType = sched/backfill
SelectType = select/cons_res
SelectTypeParameters = CR_CORE_MEMORY
SlurmUser = slurm(504)
SlurmctldDebug = 3
SlurmctldLogFile = (null)
SlurmctldPidFile = /var/run/slurmctld.pid
SlurmctldPort = 6817
SlurmctldTimeout = 300 sec
SlurmdDebug = 3
SlurmdLogFile = (null)
SlurmdPidFile = /var/run/slurmd.pid
SlurmdPort = 6818
SlurmdSpoolDir = /tmp/slurmd
SlurmdTimeout = 300 sec
SlurmdUser = root(0)
SLURM_CONF = /etc/slurm/slurm.conf
SLURM_VERSION = 2.1.14
SrunEpilog = (null)
SrunProlog = (null)
StateSaveLocation = /tmp
SuspendExcNodes = (null)
SuspendExcParts = (null)
SuspendProgram = (null)
SuspendRate = 60 nodes/min
SuspendTime = NONE
SuspendTimeout = 30 sec
SwitchType = switch/none
TaskEpilog = (null)
TaskPlugin = task/affinity
TaskPluginParam = (null type)
TaskProlog = (null)
TmpFS = /tmp
TopologyPlugin = topology/none
TrackWCKey = 0
TreeWidth = 50
UsePam = 0
UnkillableStepProgram = (null)
UnkillableStepTimeout = 60 sec
WaitTime = 0 sec
Slurmctld(primary/backup) at compbio/(NULL) are UP/DOWN
PartitionName=ib-high Priority=5 Nodes=node-ib-[1-46]
Default=NO Shared=FORCE:1 MaxTime=INFINITE State=UP AllowGroups=chemHuang
PartitionName=ib-default Priority=3 Nodes=node-ib-[1-46]
Default=YES Shared=FORCE:1 MaxTime=INFINITE State=UP AllowGroups=ALL