Dear Sir/Madam
I am configuring slurm for academic use in my University but I have encountered the following problem which I could not found the solution from the Internet. I followed all troubleshooting suggestions from your website with no luck. Whenever I start slurmd daemon in one of compute node, it starts with IDLE state but goes DOWN after 4 minutes with the reason=Node not responding. I am using slurm version 17.02 on both nodes. tail /var/log/slurmd.log on fault node gives; *********************************************************************************************************************************************************** [2017-07-05T16:56:55.118] Resource spec: Reserved system memory limit not configured for this node [2017-07-05T16:56:55.120] slurmd version 17.02.2 started [2017-07-05T16:56:55.121] slurmd started on Wed, 05 Jul 2017 16:56:55 +0900 [2017-07-05T16:56:55.121] CPUs=24 Boards=1 Sockets=2 Cores=6 Threads=2 Memory=128661 TmpDisk=262012 Uptime=169125 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2017-07-05T16:59:20.513] Slurmd shutdown completing [2017-07-05T16:59:20.548] Message aggregation disabled [2017-07-05T16:59:20.549] Resource spec: Reserved system memory limit not configured for this node [2017-07-05T16:59:20.552] slurmd version 17.02.2 started [2017-07-05T16:59:20.552] slurmd started on Wed, 05 Jul 2017 16:59:20 +0900 [2017-07-05T16:59:20.553] CPUs=24 Boards=1 Sockets=2 Cores=6 Threads=2 Memory=128661 TmpDisk=262012 Uptime=169270 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) *************************************************************************************************************************************************************** tail /var/log/slurmctld.log on controller node gives; ************************************************************************************ [2017-07-05T17:54:56.422] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=0 [2017-07-05T17:55:09.004] Node OBU-N6 now responding [2017-07-05T17:55:09.004] node OBU-N6 returned to service [2017-07-05T17:59:52.677] error: Nodes OBU-N6 not responding [2017-07-05T18:03:15.857] error: Nodes OBU-N6 not responding, setting DOWN ************************************************************************************ The following is my slurm.conf file content; ************************************************************************** #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# ProctrackType=proctrack/pgid ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none TaskPlugin=task/none # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # SCHEDULING FastSchedule=0 SchedulerType=sched/backfill SelectType=select/linear TreeWidth=50 # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/none #AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageType=accounting_storage/mysql #AccountingStorageType=accounting_storage/filetxt #JobCompType=jobcomp/filetxt #AccountingStorageLoc=/var/log/slurm/accounting #JobCompLoc=/var/log/slurm/job_completions ClusterName=obu JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux #SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld.log #SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd.log # # # COMPUTE NODES NodeName=OBU-N5 NodeAddr=10.251.17.170 CPUs=24 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN NodeName=OBU-N6 NodeAddr=10.251.17.171 CPUs=24 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN PartitionName=slurm-partition Nodes=OBU-N[5-6] Default=YES MaxTime=INFINITE State=UP ************************************************************************** I can ssh successfully from each node and munge daemon runs on each machine. Your help will be greatly appreciated, Sincerely, Said.
