Hi,
maybe I am wrong, but this "Zero Bytes were transmitted or received" makes me think of MUNGE. Check if your nodes and your controller have MUNGE daemon running.
Sefa Arslan писал 2015-02-12 16:48:
Hi, we are using slurm-13.12.0 When I checked by sinfo, I see all the nodes are up, but teens of "error: slurm_receive_msg: Zero Bytes were transmitted or received" lines are printed into /var/log/slurm/slurmctld.log file every second. Is there a way to see for which node these error lines are printed. Another problem, although there is lots of idle nodes and no other pending jobs at our cuda supported queue, starting of a new job take too much time, sometime it takes 10-20 minutes. for exampe "srun -n 10 -N1 -p cuda hostname" start in a seconds but " srun -n 10 -N1 --gres=gpu:2 -p cuda hostname" take more than 20 minutes. the gres.config: Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1 slurm.conf: NodeName=levrek[129-144] Procs=24 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=256000 Gres=gpu:2 PartitionName=cuda Nodes=levrek[129-144] Default=no MaxTime=15-00:00:00 defaulttime=00:02:00 State=UP DefMemPerCPU=10000 MaxMemPerNode=250000 Shared=NO Priority=1000 Our config is: scontrol show config Configuration data as of 2015-02-12T15:08:39 AccountingStorageBackupHost = (null) AccountingStorageEnforce= associations,limits AccountingStorageHost = slurmcontroller3 AccountingStorageLoc = N/A AccountingStoragePort = 6819 AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = YES AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInfinibandType = acct_gather_infiniband/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AuthType = auth/munge BackupAddr = (null) BackupController = (null) BatchStartTimeout = 10 sec BOOT_TIME = 2015-02-12T15:08:27 CacheGroups = 1 CheckpointType = checkpoint/blcr ClusterName = truba CompleteWait = 0 sec ControlAddr = slurmcontroller3 ControlMachine = slurmcontroller3 CryptoType = crypto/munge DebugFlags = NO_CONF_HASH DefMemPerNode = UNLIMITED DisableRootJobs = NO DynAllocPort = 0 EnforcePartLimits = YES Epilog = (null) EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FairShareDampeningFactor = 1 FastSchedule = 1 FirstJobId = 100000 GetEnvTimeout = 2 sec GresTypes = gpu GroupUpdateForce = 0 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 0 sec HealthCheckNodeState = ANY HealthCheckProgram = (null) InactiveLimit = 0 sec JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/linux JobAcctGatherParams = (null) JobCheckpointDir = /tmp/slurmcheckpoint JobCompHost = localhost JobCompLoc = /var/log/slurm/job_completions JobCompPort = 0 JobCompType = jobcomp/filetxt JobCompUser = root JobContainerPlugin = job_container/none JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = lua KeepAliveTime = SYSTEM_DEFAULT KillOnBadExit = 0 KillWait = 30 sec LaunchType = launch/slurm Licenses = (null) LicensesUsed = (null) MailProg = /etc/slurm/mail.slurm MaxArraySize = 65000 MaxJobCount = 1000000 MaxJobId = 4294901760 MaxMemPerNode = UNLIMITED MaxStepCount = 40000 MaxTasksPerNode = 128 MessageTimeout = 10 sec MinJobAge = 300 sec MpiDefault = none MpiParams = (null) NEXT_JOB_ID = 732449 OverTimeLimit = 0 min PluginDir = /usr/lib64/slurm PlugStackConfig = /etc/slurm/plugstack.conf PreemptMode = GANG,SUSPEND PreemptType = preempt/partition_prio PriorityDecayHalfLife = 00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = 1 PriorityFlags = 0 PriorityMaxAge = 14-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 1000 PriorityWeightFairShare = 0 PriorityWeightJobSize = 1000 PriorityWeightPartition = 1000 PriorityWeightQOS = 1000000 PrivateData = jobs ProctrackType = proctrack/cgroup Prolog = (null) PrologSlurmctld = (null) PropagatePrioProcess = 0 PropagateResourceLimits = (null) PropagateResourceLimitsExcept = MEMLOCK RebootProgram = (null) ReconfigFlags = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 300 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 1 SallocDefaultCommand = (null) SchedulerParameters = bf_max_job_test=100,partition_job_depth=100,bf_window=7200,bf_resolution=180,bf_continue,max_sched_time=4,preempt_strict_order SchedulerPort = 7321 SchedulerRootFilter = 1 SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill SelectType = select/cons_res SelectTypeParameters = CR_CPU_MEMORY SlurmUser = root(0) SlurmctldDebug = info SlurmctldLogFile = /var/log/slurm/slurmctld.log SlurmSchedLogFile = (null) SlurmctldPort = 6816-6817 SlurmctldTimeout = 300 sec SlurmdDebug = info SlurmdLogFile = /var/log/slurm/slurmd.log SlurmdPidFile = /var/run/slurmd.pid SlurmdPlugstack = (null) SlurmdPort = 6818 SlurmdSpoolDir = /tmp/slurmd SlurmdTimeout = 300 sec SlurmdUser = root(0) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /etc/slurm/slurm.conf SLURM_VERSION = 13.12.0-0pre4 SrunEpilog = (null) SrunProlog = (null) StateSaveLocation = /slurm.state SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/none TaskEpilog = (null) TaskPlugin = task/cgroup TaskPluginParam = (null type) TaskProlog = (null) TmpFS = /tmp TopologyPlugin = topology/none TrackWCKey = 0 TreeWidth = 50 UsePam = 0 UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec VSizeFactor = 0 percent WaitTime = 0 sec Thanks.. Email secured by Check Point
-- Никоноров Всеволод Дмитриевич, ОИТТиС, НИКИЭТ Vsevolod D. Nikonorov, JSC NIKIET
