[slurm-dev] Re: slurmctld.log

John Desantis Fri, 13 Feb 2015 05:51:52 -0800

Sefa,

"When I checked by sinfo, I see all the nodes are up, but teens of "error:
slurm_receive_msg: Zero Bytes were transmitted or received" lines are
printed into /var/log/slurm/slurmctld.log file every second."


I've experienced this message before.  If I recall correctly, it has to do
with a misconfiguration.  I'd double-check that all of your nodes have the
same slurm.conf file.  I've also seen this previously when I've made a
"significant" change to our configuration and simply ran "scontrol update
config".

If I were in your shoes, I'd disable the "NO_CONF_HASH" DebugFlags value,
turn up debugging verbosity, and double-check that all nodes in your
cluster have the same slurm.conf.  I would then restart all of the slurm
daemons and then restart the slurmctl daemon on your controller(s).

John DeSantis


2015-02-13 2:25 GMT-05:00 Sefa Arslan <[email protected]>:

>  Hello..
>
> I could solve the problem related with job scheduling of cuda partition.
> But I still see the "error: slurm_receive_msg: Zero Bytes were transmitted
> or received" messages even if all the nodes seem to be up, may be due to
> some little network problems. Is there any way to see for which nodes these
> messages are printed?
>
> thanks..
>
>
>   Sorumluluk Reddi <http://www.tubitak.gov.tr/sorumlulukreddi>
>  On 02/13/2015 02:19 AM, Bruce Roberts wrote:
>
>
> This might be related to 13.12 not being a production release.  You
> probably want 14.03 or 14.11 instead.
>
> I also noticed you have DebugFlags              = NO_CONF_HASH
>
> Quoting Danny...
>
> "The NO_CONF_HASH is very dangerous in most systems.  It should be avoided
> at all cost."
>
>
>
> On 02/12/2015 05:47 AM, Sefa Arslan wrote:
>
>
> Hi,
>
> we are using slurm-13.12.0
> When I checked by sinfo, I see all the nodes are up, but teens of "error:
> slurm_receive_msg: Zero Bytes were transmitted or received" lines are
> printed into /var/log/slurm/slurmctld.log file every second. Is there a way
> to see for which node these error lines are printed.
>
> Another problem, although  there is lots of idle nodes and no other
> pending jobs at our cuda supported queue, starting of a new job take too
> much time, sometime it takes 10-20 minutes.
> for exampe
> "srun -n 10 -N1  -p cuda hostname"    start in a seconds but " srun -n 10
> -N1 --gres=gpu:2 -p cuda hostname" take more than 20 minutes.
>
> the gres.config:
> Name=gpu File=/dev/nvidia0
> Name=gpu File=/dev/nvidia1
>
> slurm.conf:
>
> NodeName=levrek[129-144]     Procs=24 Sockets=2 CoresPerSocket=12
> ThreadsPerCore=2 RealMemory=256000 Gres=gpu:2
> PartitionName=cuda      Nodes=levrek[129-144]  Default=no
> MaxTime=15-00:00:00 defaulttime=00:02:00 State=UP DefMemPerCPU=10000
> MaxMemPerNode=250000 Shared=NO Priority=1000
>
> Our config is:
> scontrol show config
> Configuration data as of 2015-02-12T15:08:39
> AccountingStorageBackupHost = (null)
> AccountingStorageEnforce= associations,limits
> AccountingStorageHost   = slurmcontroller3
> AccountingStorageLoc    = N/A
> AccountingStoragePort   = 6819
> AccountingStorageType   = accounting_storage/slurmdbd
> AccountingStorageUser   = N/A
> AccountingStoreJobComment = YES
> AcctGatherEnergyType    = acct_gather_energy/none
> AcctGatherFilesystemType = acct_gather_filesystem/none
> AcctGatherInfinibandType = acct_gather_infiniband/none
> AcctGatherNodeFreq      = 0 sec
> AcctGatherProfileType   = acct_gather_profile/none
> AuthType                = auth/munge
> BackupAddr              = (null)
> BackupController        = (null)
> BatchStartTimeout       = 10 sec
> BOOT_TIME               = 2015-02-12T15:08:27
> CacheGroups             = 1
> CheckpointType          = checkpoint/blcr
> ClusterName             = truba
> CompleteWait            = 0 sec
> ControlAddr             = slurmcontroller3
> ControlMachine          = slurmcontroller3
> CryptoType              = crypto/munge
> DebugFlags              = NO_CONF_HASH
> DefMemPerNode           = UNLIMITED
> DisableRootJobs         = NO
> DynAllocPort            = 0
> EnforcePartLimits       = YES
> Epilog                  = (null)
> EpilogMsgTime           = 2000 usec
> EpilogSlurmctld         = (null)
> ExtSensorsType          = ext_sensors/none
> ExtSensorsFreq          = 0 sec
> FairShareDampeningFactor = 1
> FastSchedule            = 1
> FirstJobId              = 100000
> GetEnvTimeout           = 2 sec
> GresTypes               = gpu
> GroupUpdateForce        = 0
> GroupUpdateTime         = 600 sec
> HASH_VAL                = Match
> HealthCheckInterval     = 0 sec
> HealthCheckNodeState    = ANY
> HealthCheckProgram      = (null)
> InactiveLimit           = 0 sec
> JobAcctGatherFrequency  = 30
> JobAcctGatherType       = jobacct_gather/linux
> JobAcctGatherParams     = (null)
> JobCheckpointDir        = /tmp/slurmcheckpoint
> JobCompHost             = localhost
> JobCompLoc              = /var/log/slurm/job_completions
> JobCompPort             = 0
> JobCompType             = jobcomp/filetxt
> JobCompUser             = root
> JobContainerPlugin      = job_container/none
> JobCredentialPrivateKey = (null)
> JobCredentialPublicCertificate = (null)
> JobFileAppend           = 0
> JobRequeue              = 1
> JobSubmitPlugins        = lua
> KeepAliveTime           = SYSTEM_DEFAULT
> KillOnBadExit           = 0
> KillWait                = 30 sec
> LaunchType              = launch/slurm
> Licenses                = (null)
> LicensesUsed            = (null)
> MailProg                = /etc/slurm/mail.slurm
> MaxArraySize            = 65000
> MaxJobCount             = 1000000
> MaxJobId                = 4294901760
> MaxMemPerNode           = UNLIMITED
> MaxStepCount            = 40000
> MaxTasksPerNode         = 128
> MessageTimeout          = 10 sec
> MinJobAge               = 300 sec
> MpiDefault              = none
> MpiParams               = (null)
> NEXT_JOB_ID             = 732449
> OverTimeLimit           = 0 min
> PluginDir               = /usr/lib64/slurm
> PlugStackConfig         = /etc/slurm/plugstack.conf
> PreemptMode             = GANG,SUSPEND
> PreemptType             = preempt/partition_prio
> PriorityDecayHalfLife   = 00:00:00
> PriorityCalcPeriod      = 00:05:00
> PriorityFavorSmall      = 1
> PriorityFlags           = 0
> PriorityMaxAge          = 14-00:00:00
> PriorityUsageResetPeriod = NONE
> PriorityType            = priority/multifactor
> PriorityWeightAge       = 1000
> PriorityWeightFairShare = 0
> PriorityWeightJobSize   = 1000
> PriorityWeightPartition = 1000
> PriorityWeightQOS       = 1000000
> PrivateData             = jobs
> ProctrackType           = proctrack/cgroup
> Prolog                  = (null)
> PrologSlurmctld         = (null)
> PropagatePrioProcess    = 0
> PropagateResourceLimits = (null)
> PropagateResourceLimitsExcept = MEMLOCK
> RebootProgram           = (null)
> ReconfigFlags           = (null)
> ResumeProgram           = (null)
> ResumeRate              = 300 nodes/min
> ResumeTimeout           = 300 sec
> ResvEpilog              = (null)
> ResvOverRun             = 0 min
> ResvProlog              = (null)
> ReturnToService         = 1
> SallocDefaultCommand    = (null)
> SchedulerParameters     =
> bf_max_job_test=100,partition_job_depth=100,bf_window=7200,bf_resolution=180,bf_continue,max_sched_time=4,preempt_strict_order
> SchedulerPort           = 7321
> SchedulerRootFilter     = 1
> SchedulerTimeSlice      = 30 sec
> SchedulerType           = sched/backfill
> SelectType              = select/cons_res
> SelectTypeParameters    = CR_CPU_MEMORY
> SlurmUser               = root(0)
> SlurmctldDebug          = info
> SlurmctldLogFile        = /var/log/slurm/slurmctld.log
> SlurmSchedLogFile       = (null)
> SlurmctldPort           = 6816-6817
> SlurmctldTimeout        = 300 sec
> SlurmdDebug             = info
> SlurmdLogFile           = /var/log/slurm/slurmd.log
> SlurmdPidFile           = /var/run/slurmd.pid
> SlurmdPlugstack         = (null)
> SlurmdPort              = 6818
> SlurmdSpoolDir          = /tmp/slurmd
> SlurmdTimeout           = 300 sec
> SlurmdUser              = root(0)
> SlurmSchedLogLevel      = 0
> SlurmctldPidFile        = /var/run/slurmctld.pid
> SlurmctldPlugstack      = (null)
> SLURM_CONF              = /etc/slurm/slurm.conf
> SLURM_VERSION           = 13.12.0-0pre4
> SrunEpilog              = (null)
> SrunProlog              = (null)
> StateSaveLocation       = /slurm.state
> SuspendExcNodes         = (null)
> SuspendExcParts         = (null)
> SuspendProgram          = (null)
> SuspendRate             = 60 nodes/min
> SuspendTime             = NONE
> SuspendTimeout          = 30 sec
> SwitchType              = switch/none
> TaskEpilog              = (null)
> TaskPlugin              = task/cgroup
> TaskPluginParam         = (null type)
> TaskProlog              = (null)
> TmpFS                   = /tmp
> TopologyPlugin          = topology/none
> TrackWCKey              = 0
> TreeWidth               = 50
> UsePam                  = 0
> UnkillableStepProgram   = (null)
> UnkillableStepTimeout   = 60 sec
> VSizeFactor             = 0 percent
> WaitTime                = 0 sec
>
>
>
> Thanks..
>
>
>

[slurm-dev] Re: slurmctld.log

Reply via email to