[slurm-dev] Re: SLURM ERROR! NEED HELP

Ole Holm Nielsen Wed, 05 Jul 2017 02:26:16 -0700


Hi Said,

Could it be that you have enabled the firewall on the compute nodes?The firewall must be turned off (this requirement isn't documentedanywhere).

You may want to go through my Slurm deployment Wiki athttps://wiki.fysik.dtu.dk/niflheim/Niflheim7_Getting_started to see ifanything obvious is missing in your configuration.


Best regards,
Ole

On 07/05/2017 11:17 AM, Said Mohamed Said wrote:

Dear Sir/Madam
I am configuring slurm for academic use in my University but I haveencountered the following problem which I could not found the solutionfrom the Internet.
I followed all troubleshooting suggestions from your website with no luck.
Whenever I start slurmd daemon in one of compute node, it starts withIDLE state but goes DOWN after 4 minutes with the reason=Node notresponding.
I am using slurm version 17.02 on both nodes.


tail /var/log/slurmd.log on fault node gives;


***********************************************************************************************************************************************************
[2017-07-05T16:56:55.118] Resource spec: Reserved system memory limitnot configured for this node
[2017-07-05T16:56:55.120] slurmd version 17.02.2 started
[2017-07-05T16:56:55.121] slurmd started on Wed, 05 Jul 2017 16:56:55 +0900
[2017-07-05T16:56:55.121] CPUs=24 Boards=1 Sockets=2 Cores=6 Threads=2Memory=128661 TmpDisk=262012 Uptime=169125 CPUSpecList=(null)FeaturesAvail=(null) FeaturesActive=(null)
[2017-07-05T16:59:20.513] Slurmd shutdown completing
[2017-07-05T16:59:20.548] Message aggregation disabled
[2017-07-05T16:59:20.549] Resource spec: Reserved system memory limitnot configured for this node
[2017-07-05T16:59:20.552] slurmd version 17.02.2 started
[2017-07-05T16:59:20.552] slurmd started on Wed, 05 Jul 2017 16:59:20 +0900
[2017-07-05T16:59:20.553] CPUs=24 Boards=1 Sockets=2 Cores=6 Threads=2Memory=128661 TmpDisk=262012 Uptime=169270 CPUSpecList=(null)FeaturesAvail=(null) FeaturesActive=(null)***************************************************************************************************************************************************************
tail /var/log/slurmctld.log on controller node gives;

************************************************************************************
[2017-07-05T17:54:56.422]SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0
[2017-07-05T17:55:09.004] Node OBU-N6 now responding
[2017-07-05T17:55:09.004] node OBU-N6 returned to service
[2017-07-05T17:59:52.677] error: Nodes OBU-N6 not responding
[2017-07-05T18:03:15.857] error: Nodes OBU-N6 not responding, setting DOWN

************************************************************************************


The following is my slurm.conf file content;


**************************************************************************

#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#

# SCHEDULING
FastSchedule=0
SchedulerType=sched/backfill
SelectType=select/linear
TreeWidth=50
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageType=accounting_storage/mysql
#AccountingStorageType=accounting_storage/filetxt
#JobCompType=jobcomp/filetxt
#AccountingStorageLoc=/var/log/slurm/accounting
#JobCompLoc=/var/log/slurm/job_completions
ClusterName=obu
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=OBU-N5 NodeAddr=10.251.17.170 CPUs=24 Sockets=2CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWNNodeName=OBU-N6 NodeAddr=10.251.17.171 CPUs=24 Sockets=2CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWNPartitionName=slurm-partition Nodes=OBU-N[5-6] Default=YESMaxTime=INFINITE State=UP
**************************************************************************



I can ssh successfully from each node and munge daemon runs on each machine.


Your help will be greatly appreciated,


Sincerely,


Said.


--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
E-mail: [email protected]
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620

[slurm-dev] Re: SLURM ERROR! NEED HELP

Reply via email to