Hi all,
I've configured Slurm v. 2.6.3 on a GPU cluster with accounting support
with SlurmDBD. Find attached my configuration (slurm.conf).
It works all fine for me, but for any other user there's a wall time of
10 minutes per job step. See:
JobID Timelimit Elapsed NodeList
------------ ---------- ---------- ---------------
2751 1-00:30:00 00:10:56 k20n[001-002]
2751.batch 00:10:56 k20n001
2751.0 00:00:53 k20n[001-002]
2751.1 00:10:03 k20n[001-002]
Any idea on how to remove this limit?
Thank you,
Albert
--
---------------------------------
Dr. Albert Solernou
Research Associate
Oxford Supercomputing Centre,
University of Oxford
Tel: +44 (0)1865 610631
---------------------------------
#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=arcusgpu
ControlMachine=login1
#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurm/slurm.state
SlurmdSpoolDir=/var/spool/slurm/slurmd
SwitchType=switch/none
MpiDefault=none
# MpiParams=ports=12000-12999
SlurmctldPidFile=/var/spool/slurm/run/slurmctld.pid
SlurmdPidFile=/var/spool/slurm/slurmd.pid
# ProctrackType=proctrack/pgid
ProctrackType=proctrack/linuxproc
#PluginDir=
# CacheGroups=0
CacheGroups=1
#FirstJobId=
ReturnToService=0
RebootProgram=/usr/bin/reboot
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/cons_res
SelectTypeParameters=CR_Core
GresTypes=gpu
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/spool/slurm/ctld.log
SlurmdDebug=3
SlurmdLogFile="/var/spool/slurm/%n.d.log"
JobCompType=jobcomp/filetxt
JobCompLoc=/var/spool/slurm/log
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=login1
AccountingStoragePort=6819
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
NodeName=k20n00[1-8] Sockets=1 CoresPerSocket=6 ThreadsPerCore=2 Gres=gpu:2
State=UNKNOWN
PartitionName=k20 Nodes=k20n00[1-8] Default=NO MaxTime=120:00:00 State=UP