SLURM's QOS and resource limits web pages describe most of this:
http://www.schedmd.com/slurmdocs/qos.html
http://www.schedmd.com/slurmdocs/resource_limits.html
Quoting Lyn Gerner <[email protected]>:
PS: Moe, is there a related document? Couldn't find anything obvious.
Thanks,
Lyn
On Mon, Oct 31, 2011 at 12:59 PM, Lyn Gerner <[email protected]>wrote:
Great, thanks Moe.
On Mon, Oct 31, 2011 at 10:39 AM, Moe Jette <[email protected]> wrote:
This works for me.
What version of SLURM are you running?
You might want to look at your SlurmctldLogFile.
Lyn,
You can use the QOS mechanism was Matt is with flags (e.g.
"Flags=PartitionTimeLimit") to override partition time and/or size limits.
Quoting Matteo Guglielmi <[email protected]>:
Dear All,
I'm trying to create a simple qos called 1week which
I would like to associate to those users who do need
to run for one week instead of 2 days at maximum:
### slurm.conf ###
EnforcePartLimits=YES
TaskPlugin=task/affinity
TaskPluginParam=Sched
TopologyPlugin=topology/none
TrackWCKey=no
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_**Memory
PriorityType=priority/**multifactor
PriorityDecayHalfLife=7-0
PriorityCalcPeriod=5
PriorityFavorSmall=YES
PriorityMaxAge=7-0
PriorityUsageResetPeriod=NONE
PriorityWeightAge=1000
PriorityWeightFairshare=1000
PriorityWeightJobSize=10000
PriorityWeightPartition=10000
PriorityWeightQOS=10000
AccountingStorageEnforce=**limits,qos
AccountingStorageType=**accounting_storage/slurmdbd
JobCompType=jobcomp/none
JobAcctGatherType=jobacct_**gather/linux
PreemptMode=suspend,gang
PreemptType=preempt/partition_**prio
NodeName=DEFAULT TmpDisk=16384 State=UNKNOWN
NodeName=foff[01-08] Procs=8 CoresPerSocket=4 Sockets=2
ThreadsPerCore=1 RealMemory=7000 Weight=1 Feature=X5482,foff,fofflm
NodeName=foff[09-13] Procs=48 CoresPerSocket=12 Sockets=4
ThreadsPerCore=1 RealMemory=127000 Weight=1 Feature=6176,foff,foffhm
PartitionName=DEFAULT DefaultTime=60 MinNodes=1 MaxNodes=UNLIMITED
MaxTime=2-0 PreemptMode=SUSPEND Shared=FORCE:1 State=UP Default=NO
PartitionName=batch Nodes=foff[01-13] Default=YES
PartitionName=foff1 Nodes=foff[01-08] Priority=1000
PartitionName=foff2 Nodes=foff[09-13] Priority=1000
#################
sacctmgr list associations format=Account,Cluster,User,**
Fairshare,Partition,**defaultqos,qos tree withd
Account Cluster User Share Partition Def QOS
QOS
-------------------- ---------- ---------- --------- ----------
--------- --------------------
root superb 1
normal
root superb root 1
normal
sb superb 1
normal
sb superb belushki 1 batch
normal
sb superb fiocco 1 batch
normal
gr-fo superb 1
normal
gr-fo superb belushki 1 foff1
normal
gr-fo superb belushki 1 foff2
normal
gr-fo superb fiocco 1 foff1
normal
gr-fo superb fiocco 1 foff2
normal
sacctmgr add qos Name=1week MaxWall=7-0 Priority=100 PreemptMode=Cluster
Flags=PartitionTimeLimit
sacctmgr modify user name=belushki Account=gr-fo set qos+=1week
sacctmgr list associations format=Account,Cluster,User,**
Fairshare,Partition,**defaultqos,qos tree withd
Account Cluster User Share Partition Def QOS
QOS
-------------------- ---------- ---------- --------- ----------
--------- --------------------
root superb 1
normal
root superb root 1
normal
sb superb 1
normal
sb superb belushki 1 batch
normal
sb superb fiocco 1 batch
normal
gr-fo superb 1
normal
gr-fo superb belushki 1 foff1
1week,normal
gr-fo superb belushki 1 foff2
1week,normal
gr-fo superb fiocco 1 foff1
normal
gr-fo superb fiocco 1 foff2
normal
/etc/init.d/slurmd restart (same command was issued on all nodes too)
su - belushki
srun -p foff2 -A gr-fo --qos=1week -t 7-0 hostname
srun: error: Unable to allocate resources: Requested time limit is
invalid (exceeds some limit)
Could you tell me what I still miss in order to make it working for user
"belushki"?
Thanks,
--matt