Hello..
I could solve the problem related with job scheduling of cuda partition.
But I still see the "error: slurm_receive_msg: Zero Bytes were
transmitted or received" messages even if all the nodes seem to be up,
may be due to some little network problems. Is there any way to see for
which nodes these messages are printed?
thanks..
Sorumluluk Reddi <http://www.tubitak.gov.tr/sorumlulukreddi>
On 02/13/2015 02:19 AM, Bruce Roberts wrote:
This might be related to 13.12 not being a production release. You
probably want 14.03 or 14.11 instead.
I also noticed you have DebugFlags = NO_CONF_HASH
Quoting Danny...
"The NO_CONF_HASH is very dangerous in most systems. It should be
avoided at all cost."
On 02/12/2015 05:47 AM, Sefa Arslan wrote:
Hi,
we are using slurm-13.12.0
When I checked by sinfo, I see all the nodes are up, but teens of
"error: slurm_receive_msg: Zero Bytes were transmitted or received"
lines are printed into /var/log/slurm/slurmctld.log file every
second. Is there a way to see for which node these error lines are
printed.
Another problem, although there is lots of idle nodes and no other
pending jobs at our cuda supported queue, starting of a new job take
too much time, sometime it takes 10-20 minutes.
for exampe
"srun -n 10 -N1 -p cuda hostname" start in a seconds but " srun
-n 10 -N1 --gres=gpu:2 -p cuda hostname" take more than 20 minutes.
the gres.config:
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
slurm.conf:
NodeName=levrek[129-144] Procs=24 Sockets=2 CoresPerSocket=12
ThreadsPerCore=2 RealMemory=256000 Gres=gpu:2
PartitionName=cuda Nodes=levrek[129-144] Default=no
MaxTime=15-00:00:00 defaulttime=00:02:00 State=UP DefMemPerCPU=10000
MaxMemPerNode=250000 Shared=NO Priority=1000
Our config is:
scontrol show config
Configuration data as of 2015-02-12T15:08:39
AccountingStorageBackupHost = (null)
AccountingStorageEnforce= associations,limits
AccountingStorageHost = slurmcontroller3
AccountingStorageLoc = N/A
AccountingStoragePort = 6819
AccountingStorageType = accounting_storage/slurmdbd
AccountingStorageUser = N/A
AccountingStoreJobComment = YES
AcctGatherEnergyType = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInfinibandType = acct_gather_infiniband/none
AcctGatherNodeFreq = 0 sec
AcctGatherProfileType = acct_gather_profile/none
AuthType = auth/munge
BackupAddr = (null)
BackupController = (null)
BatchStartTimeout = 10 sec
BOOT_TIME = 2015-02-12T15:08:27
CacheGroups = 1
CheckpointType = checkpoint/blcr
ClusterName = truba
CompleteWait = 0 sec
ControlAddr = slurmcontroller3
ControlMachine = slurmcontroller3
CryptoType = crypto/munge
DebugFlags = NO_CONF_HASH
DefMemPerNode = UNLIMITED
DisableRootJobs = NO
DynAllocPort = 0
EnforcePartLimits = YES
Epilog = (null)
EpilogMsgTime = 2000 usec
EpilogSlurmctld = (null)
ExtSensorsType = ext_sensors/none
ExtSensorsFreq = 0 sec
FairShareDampeningFactor = 1
FastSchedule = 1
FirstJobId = 100000
GetEnvTimeout = 2 sec
GresTypes = gpu
GroupUpdateForce = 0
GroupUpdateTime = 600 sec
HASH_VAL = Match
HealthCheckInterval = 0 sec
HealthCheckNodeState = ANY
HealthCheckProgram = (null)
InactiveLimit = 0 sec
JobAcctGatherFrequency = 30
JobAcctGatherType = jobacct_gather/linux
JobAcctGatherParams = (null)
JobCheckpointDir = /tmp/slurmcheckpoint
JobCompHost = localhost
JobCompLoc = /var/log/slurm/job_completions
JobCompPort = 0
JobCompType = jobcomp/filetxt
JobCompUser = root
JobContainerPlugin = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend = 0
JobRequeue = 1
JobSubmitPlugins = lua
KeepAliveTime = SYSTEM_DEFAULT
KillOnBadExit = 0
KillWait = 30 sec
LaunchType = launch/slurm
Licenses = (null)
LicensesUsed = (null)
MailProg = /etc/slurm/mail.slurm
MaxArraySize = 65000
MaxJobCount = 1000000
MaxJobId = 4294901760
MaxMemPerNode = UNLIMITED
MaxStepCount = 40000
MaxTasksPerNode = 128
MessageTimeout = 10 sec
MinJobAge = 300 sec
MpiDefault = none
MpiParams = (null)
NEXT_JOB_ID = 732449
OverTimeLimit = 0 min
PluginDir = /usr/lib64/slurm
PlugStackConfig = /etc/slurm/plugstack.conf
PreemptMode = GANG,SUSPEND
PreemptType = preempt/partition_prio
PriorityDecayHalfLife = 00:00:00
PriorityCalcPeriod = 00:05:00
PriorityFavorSmall = 1
PriorityFlags = 0
PriorityMaxAge = 14-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType = priority/multifactor
PriorityWeightAge = 1000
PriorityWeightFairShare = 0
PriorityWeightJobSize = 1000
PriorityWeightPartition = 1000
PriorityWeightQOS = 1000000
PrivateData = jobs
ProctrackType = proctrack/cgroup
Prolog = (null)
PrologSlurmctld = (null)
PropagatePrioProcess = 0
PropagateResourceLimits = (null)
PropagateResourceLimitsExcept = MEMLOCK
RebootProgram = (null)
ReconfigFlags = (null)
ResumeProgram = (null)
ResumeRate = 300 nodes/min
ResumeTimeout = 300 sec
ResvEpilog = (null)
ResvOverRun = 0 min
ResvProlog = (null)
ReturnToService = 1
SallocDefaultCommand = (null)
SchedulerParameters =
bf_max_job_test=100,partition_job_depth=100,bf_window=7200,bf_resolution=180,bf_continue,max_sched_time=4,preempt_strict_order
SchedulerPort = 7321
SchedulerRootFilter = 1
SchedulerTimeSlice = 30 sec
SchedulerType = sched/backfill
SelectType = select/cons_res
SelectTypeParameters = CR_CPU_MEMORY
SlurmUser = root(0)
SlurmctldDebug = info
SlurmctldLogFile = /var/log/slurm/slurmctld.log
SlurmSchedLogFile = (null)
SlurmctldPort = 6816-6817
SlurmctldTimeout = 300 sec
SlurmdDebug = info
SlurmdLogFile = /var/log/slurm/slurmd.log
SlurmdPidFile = /var/run/slurmd.pid
SlurmdPlugstack = (null)
SlurmdPort = 6818
SlurmdSpoolDir = /tmp/slurmd
SlurmdTimeout = 300 sec
SlurmdUser = root(0)
SlurmSchedLogLevel = 0
SlurmctldPidFile = /var/run/slurmctld.pid
SlurmctldPlugstack = (null)
SLURM_CONF = /etc/slurm/slurm.conf
SLURM_VERSION = 13.12.0-0pre4
SrunEpilog = (null)
SrunProlog = (null)
StateSaveLocation = /slurm.state
SuspendExcNodes = (null)
SuspendExcParts = (null)
SuspendProgram = (null)
SuspendRate = 60 nodes/min
SuspendTime = NONE
SuspendTimeout = 30 sec
SwitchType = switch/none
TaskEpilog = (null)
TaskPlugin = task/cgroup
TaskPluginParam = (null type)
TaskProlog = (null)
TmpFS = /tmp
TopologyPlugin = topology/none
TrackWCKey = 0
TreeWidth = 50
UsePam = 0
UnkillableStepProgram = (null)
UnkillableStepTimeout = 60 sec
VSizeFactor = 0 percent
WaitTime = 0 sec
Thanks..