Dear All, I have seen solutions to similar problems in the mailing list archive and obviously I have not be successful fix this for myself.
Here is my configuration file that I have. And I have the same config file for all my nodes. As a side node all my compute nodes are diskless I do not think this should contribute to the problem but I cannot find any logs that point me to this being an issue. # grep -v -e "^$" -e "^#" /etc/slurm/slurm.conf ControlMachine=mfadmin01 AuthType=auth/munge CacheGroups=0 CryptoType=crypto/munge Epilog=/etc/slurm/job_compScript.sh MpiDefault=none ProctrackType=proctrack/pgid Prolog=/etc/slurm/job_compScript.sh ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/tmp/slurmd SlurmUser=slurm SrunEpilog=/etc/slurm/job_compScript.sh SrunProlog=/etc/slurm/job_compScript.sh StateSaveLocation=/tmp SwitchType=switch/none TaskEpilog=/etc/slurm/job_compScript.sh TaskPlugin=task/none TaskProlog=/etc/slurm/job_compScript.sh InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=300 SlurmdTimeout=300 Waittime=0 FastSchedule=1 SchedulerType=sched/backfill SelectType=select/linear AccountingStorageType=accounting_storage/none AccountingStoreJobComment=NO ClusterName=ManeFrame JobCompLoc=/etc/slurm/job_compScript.sh JobCompType=jobcomp/script JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=5 SlurmctldLogFile=/tmp/mySlurmctldLogFile SlurmdDebug=5 SlurmdLogFile=/tmp/mySlurmdLogFile SlurmSchedLogFile=/tmp/SchedLogFile NodeName=mfadmin01 Sockets=2 CoresPerSocket=4 State=UNKNOWN NodeName=mfc[0001,0003,0005] Sockets=2 CoresPerSocket=4 State=UNKNOWN PartitionName=all Nodes=mfc[0001,0003,0005] Default=YES MaxTime=INFINITE State=UP Here are the logs: SLURMCTLD on headnode # tail -f mySlurmctldLogFile [2014-05-19T15:53:35.552] debug: Priority BASIC plugin loaded [2014-05-19T15:53:35.552] debug: power_save module disabled, SuspendTime < 0 [2014-05-19T15:53:35.555] auth plugin for Munge (http://code.google.com/p/munge/) loaded [2014-05-19T15:53:35.555] debug: validate_node_specs: node mfadmin01 registered with 0 jobs [2014-05-19T15:53:38.552] debug: Spawning registration agent for mfadmin01,mfc[0001,0003,0005] 4 hosts [2014-05-19T15:53:38.553] SchedulingParameters: default_queue_depth=100 max_rpc_cnt=0 max_sched_time=4 partition_job_depth=0 [2014-05-19T15:53:38.553] debug: sched: Running job scheduler [2014-05-19T15:53:43.561] agent/is_node_resp: node:mfc0003 rpc:1001 : Communication connection failure [2014-05-19T15:53:43.561] agent/is_node_resp: node:mfc0005 rpc:1001 : Communication connection failure [2014-05-19T15:53:43.561] agent/is_node_resp: node:mfc0001 rpc:1001 : Communication connection failure [2014-05-19T15:53:44.554] error: Nodes mfc[0001,0003,0005] not responding [2014-05-19T15:54:05.000] debug: backfill: beginning [2014-05-19T15:54:05.000] debug: backfill: no jobs to backfill [2014-05-19T15:54:35.564] debug: sched: Running job scheduler [2014-05-19T15:55:17.786] debug: validate_node_specs: node mfc0001 registered with 0 jobs [2014-05-19T15:55:18.572] debug: Spawning registration agent for mfc[0003,0005] 2 hosts [2014-05-19T15:55:18.572] debug: sched: Running job scheduler [2014-05-19T15:55:22.573] error: Nodes mfc[0003,0005] not responding [2014-05-19T15:55:35.000] debug: backfill: beginning [2014-05-19T15:55:35.000] debug: backfill: no jobs to backfill [2014-05-19T15:55:35.575] debug: sched: Running job scheduler [2014-05-19T15:56:35.587] debug: sched: Running job scheduler [2014-05-19T15:56:58.591] debug: Spawning ping agent for mfadmin01,mfc0001 [2014-05-19T15:56:58.591] debug: Spawning registration agent for mfc[0003,0005] 2 hosts [2014-05-19T15:56:58.597] agent/is_node_resp: node:mfc0001 rpc:1008 : Communication connection failure [2014-05-19T15:56:59.591] error: Nodes mfc0001 not responding [2014-05-19T15:57:02.592] error: Nodes mfc[0003,0005] not responding [2014-05-19T15:57:05.000] debug: backfill: beginning [2014-05-19T15:57:05.000] debug: backfill: no jobs to backfill [2014-05-19T15:57:35.597] debug: sched: Running job scheduler [2014-05-19T15:58:35.608] debug: sched: Running job scheduler [2014-05-19T15:58:38.620] debug: Spawning ping agent for mfc0001 [2014-05-19T15:58:38.620] debug: Spawning registration agent for mfc[0003,0005] 2 hosts [2014-05-19T15:58:39.620] error: Nodes mfc0001 not responding [2014-05-19T15:58:42.620] error: Nodes mfc[0003,0005] not responding [2014-05-19T15:59:35.628] debug: sched: Running job scheduler [2014-05-19T16:00:18.636] debug: Spawning ping agent for mfadmin01,mfc0001 [2014-05-19T16:00:18.637] error: Nodes mfc[0003,0005] not responding, setting DOWN [2014-05-19T16:00:19.637] error: Nodes mfc0001 not responding [2014-05-19T16:00:19.638] node mfc0003 returned to service [2014-05-19T16:00:20.637] debug: sched: Running job scheduler [2014-05-19T16:00:35.000] debug: backfill: beginning [2014-05-19T16:00:35.000] debug: backfill: no jobs to backfill [2014-05-19T16:00:35.640] debug: sched: Running job scheduler [2014-05-19T16:01:35.651] debug: sched: Running job scheduler [2014-05-19T16:01:58.656] error: Nodes mfc0001 not responding, setting DOWN [2014-05-19T16:02:05.000] debug: backfill: beginning [2014-05-19T16:02:05.000] debug: backfill: no jobs to backfill [2014-05-19T16:02:35.663] debug: sched: Running job scheduler [2014-05-19T16:03:35.674] debug: sched: Running job scheduler [2014-05-19T16:03:38.678] debug: Spawning ping agent for mfadmin01,mfc0003 [2014-05-19T16:03:38.681] agent/is_node_resp: node:mfc0003 rpc:1008 : Communication connection failure [2014-05-19T16:03:39.678] error: Nodes mfc0003 not responding [2014-05-19T16:04:05.000] debug: backfill: beginning [2014-05-19T16:04:05.000] debug: backfill: no jobs to backfill [2014-05-19T16:04:35.685] debug: sched: Running job scheduler [2014-05-19T16:05:18.691] debug: Spawning ping agent for mfc0003 [2014-05-19T16:05:19.692] error: Nodes mfc0003 not responding [2014-05-19T16:05:35.695] debug: sched: Running job scheduler [2014-05-19T16:06:35.703] debug: sched: Running job scheduler [2014-05-19T16:06:58.706] debug: Spawning ping agent for mfadmin01,mfc0003 [2014-05-19T16:06:59.706] error: Nodes mfc0003 not responding [2014-05-19T16:07:05.000] debug: backfill: beginning [2014-05-19T16:07:05.000] debug: backfill: no jobs to backfill [2014-05-19T16:07:35.697] debug: sched: Running job scheduler [2014-05-19T16:08:35.704] debug: sched: Running job scheduler [2014-05-19T16:08:38.707] error: Nodes mfc0003 not responding, setting DOWN [2014-05-19T16:09:05.000] debug: backfill: beginning [2014-05-19T16:09:05.000] debug: backfill: no jobs to backfill [2014-05-19T16:09:35.717] debug: sched: Running job scheduler [2014-05-19T16:10:18.725] debug: Spawning ping agent for mfadmin01 [2014-05-19T16:10:35.000] debug: backfill: beginning [2014-05-19T16:10:35.000] debug: backfill: no jobs to backfill [2014-05-19T16:10:35.728] debug: sched: Running job scheduler [2014-05-19T16:11:35.736] debug: sched: Running job scheduler [2014-05-19T16:12:35.747] debug: sched: Running job scheduler [2014-05-19T16:13:35.758] debug: sched: Running job scheduler [2014-05-19T16:13:38.762] debug: Spawning ping agent for mfadmin01 [2014-05-19T16:14:05.000] debug: backfill: beginning [2014-05-19T16:14:05.000] debug: backfill: no jobs to backfill [2014-05-19T16:14:35.772] debug: sched: Running job scheduler [2014-05-19T16:15:35.784] debug: sched: Running job scheduler [2014-05-19T16:16:35.796] debug: sched: Running job scheduler [2014-05-19T16:16:58.800] debug: Spawning ping agent for mfadmin01 [2014-05-19T16:17:05.000] debug: backfill: beginning [2014-05-19T16:17:05.000] debug: backfill: no jobs to backfill [2014-05-19T16:17:35.807] debug: sched: Running job scheduler [2014-05-19T16:18:35.818] debug: sched: Running job scheduler [2014-05-19T16:19:35.829] debug: sched: Running job scheduler [2014-05-19T16:20:18.838] debug: Spawning ping agent for mfadmin01 [2014-05-19T16:20:35.000] debug: backfill: beginning [2014-05-19T16:20:35.000] debug: backfill: no jobs to backfill [2014-05-19T16:20:35.841] debug: sched: Running job scheduler [2014-05-19T16:21:35.852] debug: sched: Running job scheduler [2014-05-19T16:22:35.861] debug: sched: Running job scheduler [2014-05-19T16:23:35.869] debug: sched: Running job scheduler [2014-05-19T16:23:38.873] debug: Spawning ping agent for mfadmin01 [2014-05-19T16:24:05.000] debug: backfill: beginning [2014-05-19T16:24:05.000] debug: backfill: no jobs to backfill [2014-05-19T16:24:35.884] debug: sched: Running job scheduler [2014-05-19T16:25:18.892] debug: Spawning registration agent for mfadmin01,mfc[0001,0003,0005] 4 hosts [2014-05-19T16:25:35.000] debug: backfill: beginning [2014-05-19T16:25:35.000] debug: backfill: no jobs to backfill [2014-05-19T16:25:35.895] debug: sched: Running job scheduler [2014-05-19T16:26:35.906] debug: sched: Running job scheduler [2014-05-19T16:27:35.917] debug: sched: Running job scheduler # cat /tmp/mySlurmdLogFile on the headnode(mfadmin01) [2014-05-19T15:53:35.546] topology NONE plugin loaded [2014-05-19T15:53:35.546] CPU frequency setting not configured for this node [2014-05-19T15:53:35.546] task NONE plugin loaded [2014-05-19T15:53:35.546] auth plugin for Munge (http://code.google.com/p/munge/) loaded [2014-05-19T15:53:35.546] debug: spank: opening plugin stack /etc/slurm/plugstack.conf [2014-05-19T15:53:35.546] Munge cryptographic signature plugin loaded [2014-05-19T15:53:35.547] Warning: Core limit is only 0 KB [2014-05-19T15:53:35.547] slurmd version 14.03.3-2 started [2014-05-19T15:53:35.548] Job accounting gather NOT_INVOKED plugin loaded [2014-05-19T15:53:35.548] debug: job_container none plugin loaded [2014-05-19T15:53:35.548] switch NONE plugin loaded [2014-05-19T15:53:35.552] slurmd started on Mon, 19 May 2014 15:53:35 -0500 [2014-05-19T15:53:35.552] CPUs=8 Boards=1 Sockets=2 Cores=4 Threads=1 Memory=24018 TmpDisk=458356 Uptime=1210532 [2014-05-19T15:53:35.552] AcctGatherEnergy NONE plugin loaded [2014-05-19T15:53:35.552] AcctGatherProfile NONE plugin loaded [2014-05-19T15:53:35.552] AcctGatherInfiniband NONE plugin loaded [2014-05-19T15:53:35.552] AcctGatherFilesystem NONE plugin loaded Here is munge/unmunge output to the compute node: # munge -n | ssh mfc0001 unmunge root@mfc0001's password: STATUS: Success (0) ENCODE_HOST: mfadmin01.xxx.xxx (xxx.xxx.xx.11) ENCODE_TIME: 2014-05-19 16:14:40 -0500 (1400534080) DECODE_TIME: 2014-05-19 16:14:47 -0500 (1400534087) TTL: 300 CIPHER: aes128 (4) MAC: sha1 (3) ZIP: none (0) UID: root (0) GID: root (0) LENGTH: 0 My compute nodes lookup to head node for dns lookup. Is this a problem??? My DNS works fine from the compute node I have verified that. Basically /etc/hosts file on my compute node only contains localhost and localdomain entries. Here is my SLURMD log [root@mfc0001 ~]# cat /tmp/mySlurmdLogFile [2014-05-19T15:55:17.800] topology NONE plugin loaded [2014-05-19T15:55:17.801] CPU frequency setting not configured for this node [2014-05-19T15:55:17.808] task NONE plugin loaded [2014-05-19T15:55:17.813] auth plugin for Munge (http://code.google.com/p/munge/) loaded [2014-05-19T15:55:17.814] debug: spank: opening plugin stack /etc/slurm/plugstack.conf [2014-05-19T15:55:17.817] Munge cryptographic signature plugin loaded [2014-05-19T15:55:17.823] Warning: Core limit is only 0 KB [2014-05-19T15:55:17.823] slurmd version 14.03.3-2 started [2014-05-19T15:55:17.829] Job accounting gather NOT_INVOKED plugin loaded [2014-05-19T15:55:17.832] debug: job_container none plugin loaded [2014-05-19T15:55:17.840] switch NONE plugin loaded [2014-05-19T15:55:17.841] slurmd started on Mon, 19 May 2014 15:55:17 -0500 [2014-05-19T15:55:17.842] CPUs=8 Boards=1 Sockets=2 Cores=4 Threads=1 Memory=24145 TmpDisk=12072 Uptime=51 [2014-05-19T15:55:17.848] AcctGatherEnergy NONE plugin loaded [2014-05-19T15:55:17.852] AcctGatherProfile NONE plugin loaded [2014-05-19T15:55:17.855] AcctGatherInfiniband NONE plugin loaded [2014-05-19T15:55:17.858] AcctGatherFilesystem NONE plugin loaded Any insight into fixing this will be a great help for me to progress in my effort here. Please let me know if additional information would help. Best Regards, Amit
