Take a look at your slurmctld and slurmd log files. My _guess_ is that the clock on one or more of your nodes is out of sync and that is preventing message authentication from occurring. As I recall Munge credentials have a five minute period of being valid. If any of your nodes have a clock more than that far out of sync, messages will get discarded. Although SLURM does have some recovery mechanisms, long delays like this will occur.
Quoting Paul Thirumalai <[email protected]>:
Hi All So I am trying to run launch a script using sbatch, but it just seems to be taking too long to complete. (Sbatch takes 2-3 seconds to complete) The commmand i am using is /usr/bin/sbatch --output=/dev/null --error=/dev/null --begin=now <script_name> This comand takes around 3.5 seconds to complete. I am not sure why its taking so long. Earlier I had changed the config to use select/linear instead of select/cons_res, and after that all the issues started. I reverted back the config, but to no avail. Please help!!, This issue is really degrading the the performance of my slurm install.Each job submit just takes too long. My config is below # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=z21 #ControlAddr= BackupController=z18 #BackupAddr= # AuthType=auth/none #AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #PrologSlurmctld= #FirstJobId=1 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobCheckpointDir=/var/slurm/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #Licenses=foo*4,bar #MailProg=/bin/mail #MaxJobCount=10000 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs #ProctrackType=proctrack/pgid ProctrackType=proctrack/linuxproc #Prolog= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6820-6823 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/tmp/slurmd SlurmUser=slurm #SrunEpilog= #SrunProlog= StateSaveLocation=/tmp SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFs=/tmp #TrackWCKey=no #TreeWidth=327 #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 MessageTimeout=100 #ResvOverRun=0 #MinJobAge=20 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 FastSchedule=0 #MaxMemPerCPU=0 #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SchedulerPort=7321 SchedulerParameters=max_job_bf=50,interval=60 #SelectType=select/linear SelectType=select/cons_res SelectTypeParameters=CR_Core # # # JOB PRIORITY #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 AccountingStorageHost=z21 AccountingStorageLoc=slurm_job_acc AccountingStoragePass=slurm AccountingStoragePort=3306 AccountingStorageType=accounting_storage/mysql AccountingStorageUser=slurm ClusterName=cluster #DebugFlags= JobCompHost=z21 JobCompLoc=slurm_job_comp JobCompPass=slurm JobCompPort=3306 JobCompType=jobcomp/mysql JobCompUser=slurm JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux #SlurmctldDebug= SlurmctldLogFile=/var/log/slurm/slurmctld.log #SlurmdDebug= SlurmdLogFile=/var/log/slurm/slurmd.log.%h #SlurmSchedLogFile= #SlurmSchedLogLevel= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=z[23,24,26,28-39] Procs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 NodeName=z[25,27] Procs=2 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 NodeName=obi[23-40] Procs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 NodeName=w[001-108] Procs=2 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 NodeName=y[002-010,012-025,027-038] Procs=2 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 NodeName=y[040-062,064-108,111-119,121-186] Procs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 PartitionName=z_part Nodes=z[23-39] Default=NO Shared=NO MaxNodes=1 MaxTime=INFINITE State=UP PartitionName=obi_part Nodes=obi[23-40] Default=NO Shared=NO MaxNodes=1 MaxTime=INFINITE State=UP PartitionName=w_part Nodes=w[001-108] Default=NO Shared=NO MaxNodes=1 MaxTime=INFINITE State=UP PartitionName=y_part Nodes=y[002-010,012-025,027-038,040-062,064-108,111-119,121-186] Default=NO Shared=NO MaxTime=INFINITE State=UP PartitionName=all_part Nodes=z[25-39],obi[23-40],w[001-108],y[002-010,012-025,027-038,041-062,064-108,111-119,121-186] Shared=NO Default=YES MaxNodes=1 MaxTime=INFINITE State=UP PartitionName=gov_part Nodes=z[23,24] Shared=NO Default=NO MaxNodes=1 MaxTime=INFINITE State=UP
