Sefa, "When I checked by sinfo, I see all the nodes are up, but teens of "error: slurm_receive_msg: Zero Bytes were transmitted or received" lines are printed into /var/log/slurm/slurmctld.log file every second."
I've experienced this message before. If I recall correctly, it has to do with a misconfiguration. I'd double-check that all of your nodes have the same slurm.conf file. I've also seen this previously when I've made a "significant" change to our configuration and simply ran "scontrol update config". If I were in your shoes, I'd disable the "NO_CONF_HASH" DebugFlags value, turn up debugging verbosity, and double-check that all nodes in your cluster have the same slurm.conf. I would then restart all of the slurm daemons and then restart the slurmctl daemon on your controller(s). John DeSantis 2015-02-13 2:25 GMT-05:00 Sefa Arslan <[email protected]>: > Hello.. > > I could solve the problem related with job scheduling of cuda partition. > But I still see the "error: slurm_receive_msg: Zero Bytes were transmitted > or received" messages even if all the nodes seem to be up, may be due to > some little network problems. Is there any way to see for which nodes these > messages are printed? > > thanks.. > > > Sorumluluk Reddi <http://www.tubitak.gov.tr/sorumlulukreddi> > On 02/13/2015 02:19 AM, Bruce Roberts wrote: > > > This might be related to 13.12 not being a production release. You > probably want 14.03 or 14.11 instead. > > I also noticed you have DebugFlags = NO_CONF_HASH > > Quoting Danny... > > "The NO_CONF_HASH is very dangerous in most systems. It should be avoided > at all cost." > > > > On 02/12/2015 05:47 AM, Sefa Arslan wrote: > > > Hi, > > we are using slurm-13.12.0 > When I checked by sinfo, I see all the nodes are up, but teens of "error: > slurm_receive_msg: Zero Bytes were transmitted or received" lines are > printed into /var/log/slurm/slurmctld.log file every second. Is there a way > to see for which node these error lines are printed. > > Another problem, although there is lots of idle nodes and no other > pending jobs at our cuda supported queue, starting of a new job take too > much time, sometime it takes 10-20 minutes. > for exampe > "srun -n 10 -N1 -p cuda hostname" start in a seconds but " srun -n 10 > -N1 --gres=gpu:2 -p cuda hostname" take more than 20 minutes. > > the gres.config: > Name=gpu File=/dev/nvidia0 > Name=gpu File=/dev/nvidia1 > > slurm.conf: > > NodeName=levrek[129-144] Procs=24 Sockets=2 CoresPerSocket=12 > ThreadsPerCore=2 RealMemory=256000 Gres=gpu:2 > PartitionName=cuda Nodes=levrek[129-144] Default=no > MaxTime=15-00:00:00 defaulttime=00:02:00 State=UP DefMemPerCPU=10000 > MaxMemPerNode=250000 Shared=NO Priority=1000 > > Our config is: > scontrol show config > Configuration data as of 2015-02-12T15:08:39 > AccountingStorageBackupHost = (null) > AccountingStorageEnforce= associations,limits > AccountingStorageHost = slurmcontroller3 > AccountingStorageLoc = N/A > AccountingStoragePort = 6819 > AccountingStorageType = accounting_storage/slurmdbd > AccountingStorageUser = N/A > AccountingStoreJobComment = YES > AcctGatherEnergyType = acct_gather_energy/none > AcctGatherFilesystemType = acct_gather_filesystem/none > AcctGatherInfinibandType = acct_gather_infiniband/none > AcctGatherNodeFreq = 0 sec > AcctGatherProfileType = acct_gather_profile/none > AuthType = auth/munge > BackupAddr = (null) > BackupController = (null) > BatchStartTimeout = 10 sec > BOOT_TIME = 2015-02-12T15:08:27 > CacheGroups = 1 > CheckpointType = checkpoint/blcr > ClusterName = truba > CompleteWait = 0 sec > ControlAddr = slurmcontroller3 > ControlMachine = slurmcontroller3 > CryptoType = crypto/munge > DebugFlags = NO_CONF_HASH > DefMemPerNode = UNLIMITED > DisableRootJobs = NO > DynAllocPort = 0 > EnforcePartLimits = YES > Epilog = (null) > EpilogMsgTime = 2000 usec > EpilogSlurmctld = (null) > ExtSensorsType = ext_sensors/none > ExtSensorsFreq = 0 sec > FairShareDampeningFactor = 1 > FastSchedule = 1 > FirstJobId = 100000 > GetEnvTimeout = 2 sec > GresTypes = gpu > GroupUpdateForce = 0 > GroupUpdateTime = 600 sec > HASH_VAL = Match > HealthCheckInterval = 0 sec > HealthCheckNodeState = ANY > HealthCheckProgram = (null) > InactiveLimit = 0 sec > JobAcctGatherFrequency = 30 > JobAcctGatherType = jobacct_gather/linux > JobAcctGatherParams = (null) > JobCheckpointDir = /tmp/slurmcheckpoint > JobCompHost = localhost > JobCompLoc = /var/log/slurm/job_completions > JobCompPort = 0 > JobCompType = jobcomp/filetxt > JobCompUser = root > JobContainerPlugin = job_container/none > JobCredentialPrivateKey = (null) > JobCredentialPublicCertificate = (null) > JobFileAppend = 0 > JobRequeue = 1 > JobSubmitPlugins = lua > KeepAliveTime = SYSTEM_DEFAULT > KillOnBadExit = 0 > KillWait = 30 sec > LaunchType = launch/slurm > Licenses = (null) > LicensesUsed = (null) > MailProg = /etc/slurm/mail.slurm > MaxArraySize = 65000 > MaxJobCount = 1000000 > MaxJobId = 4294901760 > MaxMemPerNode = UNLIMITED > MaxStepCount = 40000 > MaxTasksPerNode = 128 > MessageTimeout = 10 sec > MinJobAge = 300 sec > MpiDefault = none > MpiParams = (null) > NEXT_JOB_ID = 732449 > OverTimeLimit = 0 min > PluginDir = /usr/lib64/slurm > PlugStackConfig = /etc/slurm/plugstack.conf > PreemptMode = GANG,SUSPEND > PreemptType = preempt/partition_prio > PriorityDecayHalfLife = 00:00:00 > PriorityCalcPeriod = 00:05:00 > PriorityFavorSmall = 1 > PriorityFlags = 0 > PriorityMaxAge = 14-00:00:00 > PriorityUsageResetPeriod = NONE > PriorityType = priority/multifactor > PriorityWeightAge = 1000 > PriorityWeightFairShare = 0 > PriorityWeightJobSize = 1000 > PriorityWeightPartition = 1000 > PriorityWeightQOS = 1000000 > PrivateData = jobs > ProctrackType = proctrack/cgroup > Prolog = (null) > PrologSlurmctld = (null) > PropagatePrioProcess = 0 > PropagateResourceLimits = (null) > PropagateResourceLimitsExcept = MEMLOCK > RebootProgram = (null) > ReconfigFlags = (null) > ResumeProgram = (null) > ResumeRate = 300 nodes/min > ResumeTimeout = 300 sec > ResvEpilog = (null) > ResvOverRun = 0 min > ResvProlog = (null) > ReturnToService = 1 > SallocDefaultCommand = (null) > SchedulerParameters = > bf_max_job_test=100,partition_job_depth=100,bf_window=7200,bf_resolution=180,bf_continue,max_sched_time=4,preempt_strict_order > SchedulerPort = 7321 > SchedulerRootFilter = 1 > SchedulerTimeSlice = 30 sec > SchedulerType = sched/backfill > SelectType = select/cons_res > SelectTypeParameters = CR_CPU_MEMORY > SlurmUser = root(0) > SlurmctldDebug = info > SlurmctldLogFile = /var/log/slurm/slurmctld.log > SlurmSchedLogFile = (null) > SlurmctldPort = 6816-6817 > SlurmctldTimeout = 300 sec > SlurmdDebug = info > SlurmdLogFile = /var/log/slurm/slurmd.log > SlurmdPidFile = /var/run/slurmd.pid > SlurmdPlugstack = (null) > SlurmdPort = 6818 > SlurmdSpoolDir = /tmp/slurmd > SlurmdTimeout = 300 sec > SlurmdUser = root(0) > SlurmSchedLogLevel = 0 > SlurmctldPidFile = /var/run/slurmctld.pid > SlurmctldPlugstack = (null) > SLURM_CONF = /etc/slurm/slurm.conf > SLURM_VERSION = 13.12.0-0pre4 > SrunEpilog = (null) > SrunProlog = (null) > StateSaveLocation = /slurm.state > SuspendExcNodes = (null) > SuspendExcParts = (null) > SuspendProgram = (null) > SuspendRate = 60 nodes/min > SuspendTime = NONE > SuspendTimeout = 30 sec > SwitchType = switch/none > TaskEpilog = (null) > TaskPlugin = task/cgroup > TaskPluginParam = (null type) > TaskProlog = (null) > TmpFS = /tmp > TopologyPlugin = topology/none > TrackWCKey = 0 > TreeWidth = 50 > UsePam = 0 > UnkillableStepProgram = (null) > UnkillableStepTimeout = 60 sec > VSizeFactor = 0 percent > WaitTime = 0 sec > > > > Thanks.. > > >
