Hi Said,

Could it be that you have enabled the firewall on the compute nodes? The firewall must be turned off (this requirement isn't documented anywhere).

You may want to go through my Slurm deployment Wiki at https://wiki.fysik.dtu.dk/niflheim/Niflheim7_Getting_started to see if anything obvious is missing in your configuration.

Best regards,
Ole

On 07/05/2017 11:17 AM, Said Mohamed Said wrote:
Dear Sir/Madam


I am configuring slurm for academic use in my University but I have encountered the following problem which I could not found the solution from the Internet.


I followed all troubleshooting suggestions from your website with no luck.


Whenever I start slurmd daemon in one of compute node, it starts with IDLE state but goes DOWN after 4 minutes with the reason=Node not responding.

I am using slurm version 17.02 on both nodes.


tail /var/log/slurmd.log on fault node gives;


***********************************************************************************************************************************************************

[2017-07-05T16:56:55.118] Resource spec: Reserved system memory limit not configured for this node
[2017-07-05T16:56:55.120] slurmd version 17.02.2 started
[2017-07-05T16:56:55.121] slurmd started on Wed, 05 Jul 2017 16:56:55 +0900
[2017-07-05T16:56:55.121] CPUs=24 Boards=1 Sockets=2 Cores=6 Threads=2 Memory=128661 TmpDisk=262012 Uptime=169125 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2017-07-05T16:59:20.513] Slurmd shutdown completing
[2017-07-05T16:59:20.548] Message aggregation disabled
[2017-07-05T16:59:20.549] Resource spec: Reserved system memory limit not configured for this node
[2017-07-05T16:59:20.552] slurmd version 17.02.2 started
[2017-07-05T16:59:20.552] slurmd started on Wed, 05 Jul 2017 16:59:20 +0900
[2017-07-05T16:59:20.553] CPUs=24 Boards=1 Sockets=2 Cores=6 Threads=2 Memory=128661 TmpDisk=262012 Uptime=169270 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) ***************************************************************************************************************************************************************



tail /var/log/slurmctld.log on controller node gives;

************************************************************************************

[2017-07-05T17:54:56.422] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0
[2017-07-05T17:55:09.004] Node OBU-N6 now responding
[2017-07-05T17:55:09.004] node OBU-N6 returned to service
[2017-07-05T17:59:52.677] error: Nodes OBU-N6 not responding
[2017-07-05T18:03:15.857] error: Nodes OBU-N6 not responding, setting DOWN

************************************************************************************


The following is my slurm.conf file content;


**************************************************************************

#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#

# SCHEDULING
FastSchedule=0
SchedulerType=sched/backfill
SelectType=select/linear
TreeWidth=50
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageType=accounting_storage/mysql
#AccountingStorageType=accounting_storage/filetxt
#JobCompType=jobcomp/filetxt
#AccountingStorageLoc=/var/log/slurm/accounting
#JobCompLoc=/var/log/slurm/job_completions
ClusterName=obu
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=OBU-N5 NodeAddr=10.251.17.170 CPUs=24 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN NodeName=OBU-N6 NodeAddr=10.251.17.171 CPUs=24 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN PartitionName=slurm-partition Nodes=OBU-N[5-6] Default=YES MaxTime=INFINITE State=UP

**************************************************************************



I can ssh successfully from each node and munge daemon runs on each machine.


Your help will be greatly appreciated,


Sincerely,


Said.



--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
E-mail: [email protected]
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620

Reply via email to