Take a look at your slurmctld and slurmd log files. My _guess_
is that the clock on one or more of your nodes is out of sync
and that is preventing message authentication from occurring.
As I recall Munge credentials have a five minute period of
being valid. If any of your nodes have a clock more than that
far out of sync, messages will get discarded. Although SLURM
does have some recovery mechanisms, long delays like this will
occur.

Quoting Paul Thirumalai <[email protected]>:

Hi All
So I am trying to run launch a script using sbatch, but it just seems to be
taking too long to complete. (Sbatch takes 2-3 seconds to complete)

The commmand i am using is
/usr/bin/sbatch --output=/dev/null --error=/dev/null  --begin=now
<script_name>

This comand takes around 3.5 seconds to complete. I am not sure why its
taking so long. Earlier I had changed the config to use select/linear
instead of select/cons_res, and after that all the issues started. I
reverted back the config, but to no avail.

 Please help!!, This issue is really degrading the the performance of my
slurm install.Each job submit just takes too long.

My config is below
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=z21
#ControlAddr=
BackupController=z18
#BackupAddr=
#
AuthType=auth/none
#AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#PrologSlurmctld=
#FirstJobId=1
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=10000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
#ProctrackType=proctrack/pgid
ProctrackType=proctrack/linuxproc
#Prolog=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6820-6823
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
SlurmUser=slurm
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/tmp
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFs=/tmp
#TrackWCKey=no
#TreeWidth=327
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
MessageTimeout=100
#ResvOverRun=0
#MinJobAge=20
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=0
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SchedulerParameters=max_job_bf=50,interval=60
#SelectType=select/linear
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
AccountingStorageHost=z21
AccountingStorageLoc=slurm_job_acc
AccountingStoragePass=slurm
AccountingStoragePort=3306
AccountingStorageType=accounting_storage/mysql
AccountingStorageUser=slurm
ClusterName=cluster
#DebugFlags=
JobCompHost=z21
JobCompLoc=slurm_job_comp
JobCompPass=slurm
JobCompPort=3306
JobCompType=jobcomp/mysql
JobCompUser=slurm
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=
SlurmctldLogFile=/var/log/slurm/slurmctld.log
#SlurmdDebug=
SlurmdLogFile=/var/log/slurm/slurmd.log.%h
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=z[23,24,26,28-39] Procs=4 Sockets=1 CoresPerSocket=4
ThreadsPerCore=1
NodeName=z[25,27] Procs=2 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1
NodeName=obi[23-40] Procs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1
NodeName=w[001-108] Procs=2 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1
NodeName=y[002-010,012-025,027-038] Procs=2 Sockets=1 CoresPerSocket=2
ThreadsPerCore=1
NodeName=y[040-062,064-108,111-119,121-186] Procs=4 Sockets=1
CoresPerSocket=4 ThreadsPerCore=1
PartitionName=z_part Nodes=z[23-39] Default=NO Shared=NO MaxNodes=1
MaxTime=INFINITE State=UP
PartitionName=obi_part Nodes=obi[23-40] Default=NO Shared=NO MaxNodes=1
MaxTime=INFINITE State=UP
PartitionName=w_part Nodes=w[001-108] Default=NO Shared=NO MaxNodes=1
MaxTime=INFINITE State=UP
PartitionName=y_part
Nodes=y[002-010,012-025,027-038,040-062,064-108,111-119,121-186] Default=NO
Shared=NO MaxTime=INFINITE State=UP
PartitionName=all_part
Nodes=z[25-39],obi[23-40],w[001-108],y[002-010,012-025,027-038,041-062,064-108,111-119,121-186]
Shared=NO Default=YES MaxNodes=1 MaxTime=INFINITE State=UP
PartitionName=gov_part Nodes=z[23,24] Shared=NO Default=NO MaxNodes=1
MaxTime=INFINITE State=UP





Reply via email to