Hi Said,
Could it be that you have enabled the firewall on the compute nodes?
The firewall must be turned off (this requirement isn't documented
anywhere).
You may want to go through my Slurm deployment Wiki at
https://wiki.fysik.dtu.dk/niflheim/Niflheim7_Getting_started to see if
anything obvious is missing in your configuration.
Best regards,
Ole
On 07/05/2017 11:17 AM, Said Mohamed Said wrote:
Dear Sir/Madam
I am configuring slurm for academic use in my University but I have
encountered the following problem which I could not found the solution
from the Internet.
I followed all troubleshooting suggestions from your website with no luck.
Whenever I start slurmd daemon in one of compute node, it starts with
IDLE state but goes DOWN after 4 minutes with the reason=Node not
responding.
I am using slurm version 17.02 on both nodes.
tail /var/log/slurmd.log on fault node gives;
***********************************************************************************************************************************************************
[2017-07-05T16:56:55.118] Resource spec: Reserved system memory limit
not configured for this node
[2017-07-05T16:56:55.120] slurmd version 17.02.2 started
[2017-07-05T16:56:55.121] slurmd started on Wed, 05 Jul 2017 16:56:55 +0900
[2017-07-05T16:56:55.121] CPUs=24 Boards=1 Sockets=2 Cores=6 Threads=2
Memory=128661 TmpDisk=262012 Uptime=169125 CPUSpecList=(null)
FeaturesAvail=(null) FeaturesActive=(null)
[2017-07-05T16:59:20.513] Slurmd shutdown completing
[2017-07-05T16:59:20.548] Message aggregation disabled
[2017-07-05T16:59:20.549] Resource spec: Reserved system memory limit
not configured for this node
[2017-07-05T16:59:20.552] slurmd version 17.02.2 started
[2017-07-05T16:59:20.552] slurmd started on Wed, 05 Jul 2017 16:59:20 +0900
[2017-07-05T16:59:20.553] CPUs=24 Boards=1 Sockets=2 Cores=6 Threads=2
Memory=128661 TmpDisk=262012 Uptime=169270 CPUSpecList=(null)
FeaturesAvail=(null) FeaturesActive=(null)
***************************************************************************************************************************************************************
tail /var/log/slurmctld.log on controller node gives;
************************************************************************************
[2017-07-05T17:54:56.422]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0
[2017-07-05T17:55:09.004] Node OBU-N6 now responding
[2017-07-05T17:55:09.004] node OBU-N6 returned to service
[2017-07-05T17:59:52.677] error: Nodes OBU-N6 not responding
[2017-07-05T18:03:15.857] error: Nodes OBU-N6 not responding, setting DOWN
************************************************************************************
The following is my slurm.conf file content;
**************************************************************************
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
# SCHEDULING
FastSchedule=0
SchedulerType=sched/backfill
SelectType=select/linear
TreeWidth=50
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageType=accounting_storage/mysql
#AccountingStorageType=accounting_storage/filetxt
#JobCompType=jobcomp/filetxt
#AccountingStorageLoc=/var/log/slurm/accounting
#JobCompLoc=/var/log/slurm/job_completions
ClusterName=obu
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=OBU-N5 NodeAddr=10.251.17.170 CPUs=24 Sockets=2
CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN
NodeName=OBU-N6 NodeAddr=10.251.17.171 CPUs=24 Sockets=2
CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN
PartitionName=slurm-partition Nodes=OBU-N[5-6] Default=YES
MaxTime=INFINITE State=UP
**************************************************************************
I can ssh successfully from each node and munge daemon runs on each machine.
Your help will be greatly appreciated,
Sincerely,
Said.
--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
E-mail: [email protected]
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620