Dear All,

I have seen solutions to similar problems in the mailing list archive and 
obviously I have not be successful fix this for myself.

Here is my configuration file that I have. And I have the same config file for 
all my nodes. As a side node all my compute nodes are diskless I do not think 
this should contribute to the problem but I cannot find any logs that point me 
to this being an issue.

# grep -v -e "^$" -e "^#"  /etc/slurm/slurm.conf
ControlMachine=mfadmin01
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
Epilog=/etc/slurm/job_compScript.sh
MpiDefault=none
ProctrackType=proctrack/pgid
Prolog=/etc/slurm/job_compScript.sh
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
SlurmUser=slurm
SrunEpilog=/etc/slurm/job_compScript.sh
SrunProlog=/etc/slurm/job_compScript.sh
StateSaveLocation=/tmp
SwitchType=switch/none
TaskEpilog=/etc/slurm/job_compScript.sh
TaskPlugin=task/none
TaskProlog=/etc/slurm/job_compScript.sh
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=300
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/linear
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=NO
ClusterName=ManeFrame
JobCompLoc=/etc/slurm/job_compScript.sh
JobCompType=jobcomp/script
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=5
SlurmctldLogFile=/tmp/mySlurmctldLogFile
SlurmdDebug=5
SlurmdLogFile=/tmp/mySlurmdLogFile
SlurmSchedLogFile=/tmp/SchedLogFile
NodeName=mfadmin01 Sockets=2 CoresPerSocket=4 State=UNKNOWN 
NodeName=mfc[0001,0003,0005] Sockets=2 CoresPerSocket=4 State=UNKNOWN 
PartitionName=all Nodes=mfc[0001,0003,0005] Default=YES MaxTime=INFINITE 
State=UP


Here are the logs:
SLURMCTLD on headnode

# tail -f mySlurmctldLogFile
[2014-05-19T15:53:35.552] debug:  Priority BASIC plugin loaded
[2014-05-19T15:53:35.552] debug:  power_save module disabled, SuspendTime < 0
[2014-05-19T15:53:35.555] auth plugin for Munge 
(http://code.google.com/p/munge/) loaded
[2014-05-19T15:53:35.555] debug:  validate_node_specs: node mfadmin01 
registered with 0 jobs
[2014-05-19T15:53:38.552] debug:  Spawning registration agent for 
mfadmin01,mfc[0001,0003,0005] 4 hosts
[2014-05-19T15:53:38.553] SchedulingParameters: default_queue_depth=100 
max_rpc_cnt=0 max_sched_time=4 partition_job_depth=0
[2014-05-19T15:53:38.553] debug:  sched: Running job scheduler
[2014-05-19T15:53:43.561] agent/is_node_resp: node:mfc0003 rpc:1001 : 
Communication connection failure
[2014-05-19T15:53:43.561] agent/is_node_resp: node:mfc0005 rpc:1001 : 
Communication connection failure
[2014-05-19T15:53:43.561] agent/is_node_resp: node:mfc0001 rpc:1001 : 
Communication connection failure
[2014-05-19T15:53:44.554] error: Nodes mfc[0001,0003,0005] not responding
[2014-05-19T15:54:05.000] debug:  backfill: beginning
[2014-05-19T15:54:05.000] debug:  backfill: no jobs to backfill
[2014-05-19T15:54:35.564] debug:  sched: Running job scheduler
[2014-05-19T15:55:17.786] debug:  validate_node_specs: node mfc0001 registered 
with 0 jobs
[2014-05-19T15:55:18.572] debug:  Spawning registration agent for 
mfc[0003,0005] 2 hosts
[2014-05-19T15:55:18.572] debug:  sched: Running job scheduler
[2014-05-19T15:55:22.573] error: Nodes mfc[0003,0005] not responding
[2014-05-19T15:55:35.000] debug:  backfill: beginning
[2014-05-19T15:55:35.000] debug:  backfill: no jobs to backfill
[2014-05-19T15:55:35.575] debug:  sched: Running job scheduler
[2014-05-19T15:56:35.587] debug:  sched: Running job scheduler
[2014-05-19T15:56:58.591] debug:  Spawning ping agent for mfadmin01,mfc0001
[2014-05-19T15:56:58.591] debug:  Spawning registration agent for 
mfc[0003,0005] 2 hosts
[2014-05-19T15:56:58.597] agent/is_node_resp: node:mfc0001 rpc:1008 : 
Communication connection failure
[2014-05-19T15:56:59.591] error: Nodes mfc0001 not responding
[2014-05-19T15:57:02.592] error: Nodes mfc[0003,0005] not responding
[2014-05-19T15:57:05.000] debug:  backfill: beginning
[2014-05-19T15:57:05.000] debug:  backfill: no jobs to backfill
[2014-05-19T15:57:35.597] debug:  sched: Running job scheduler
[2014-05-19T15:58:35.608] debug:  sched: Running job scheduler
[2014-05-19T15:58:38.620] debug:  Spawning ping agent for mfc0001
[2014-05-19T15:58:38.620] debug:  Spawning registration agent for 
mfc[0003,0005] 2 hosts
[2014-05-19T15:58:39.620] error: Nodes mfc0001 not responding
[2014-05-19T15:58:42.620] error: Nodes mfc[0003,0005] not responding
[2014-05-19T15:59:35.628] debug:  sched: Running job scheduler
[2014-05-19T16:00:18.636] debug:  Spawning ping agent for mfadmin01,mfc0001
[2014-05-19T16:00:18.637] error: Nodes mfc[0003,0005] not responding, setting 
DOWN
[2014-05-19T16:00:19.637] error: Nodes mfc0001 not responding
[2014-05-19T16:00:19.638] node mfc0003 returned to service
[2014-05-19T16:00:20.637] debug:  sched: Running job scheduler
[2014-05-19T16:00:35.000] debug:  backfill: beginning
[2014-05-19T16:00:35.000] debug:  backfill: no jobs to backfill
[2014-05-19T16:00:35.640] debug:  sched: Running job scheduler
[2014-05-19T16:01:35.651] debug:  sched: Running job scheduler
[2014-05-19T16:01:58.656] error: Nodes mfc0001 not responding, setting DOWN
[2014-05-19T16:02:05.000] debug:  backfill: beginning
[2014-05-19T16:02:05.000] debug:  backfill: no jobs to backfill
[2014-05-19T16:02:35.663] debug:  sched: Running job scheduler
[2014-05-19T16:03:35.674] debug:  sched: Running job scheduler
[2014-05-19T16:03:38.678] debug:  Spawning ping agent for mfadmin01,mfc0003
[2014-05-19T16:03:38.681] agent/is_node_resp: node:mfc0003 rpc:1008 : 
Communication connection failure
[2014-05-19T16:03:39.678] error: Nodes mfc0003 not responding
[2014-05-19T16:04:05.000] debug:  backfill: beginning
[2014-05-19T16:04:05.000] debug:  backfill: no jobs to backfill
[2014-05-19T16:04:35.685] debug:  sched: Running job scheduler
[2014-05-19T16:05:18.691] debug:  Spawning ping agent for mfc0003
[2014-05-19T16:05:19.692] error: Nodes mfc0003 not responding
[2014-05-19T16:05:35.695] debug:  sched: Running job scheduler
[2014-05-19T16:06:35.703] debug:  sched: Running job scheduler
[2014-05-19T16:06:58.706] debug:  Spawning ping agent for mfadmin01,mfc0003
[2014-05-19T16:06:59.706] error: Nodes mfc0003 not responding
[2014-05-19T16:07:05.000] debug:  backfill: beginning
[2014-05-19T16:07:05.000] debug:  backfill: no jobs to backfill
[2014-05-19T16:07:35.697] debug:  sched: Running job scheduler
[2014-05-19T16:08:35.704] debug:  sched: Running job scheduler
[2014-05-19T16:08:38.707] error: Nodes mfc0003 not responding, setting DOWN
[2014-05-19T16:09:05.000] debug:  backfill: beginning
[2014-05-19T16:09:05.000] debug:  backfill: no jobs to backfill
[2014-05-19T16:09:35.717] debug:  sched: Running job scheduler
[2014-05-19T16:10:18.725] debug:  Spawning ping agent for mfadmin01
[2014-05-19T16:10:35.000] debug:  backfill: beginning
[2014-05-19T16:10:35.000] debug:  backfill: no jobs to backfill
[2014-05-19T16:10:35.728] debug:  sched: Running job scheduler
[2014-05-19T16:11:35.736] debug:  sched: Running job scheduler
[2014-05-19T16:12:35.747] debug:  sched: Running job scheduler
[2014-05-19T16:13:35.758] debug:  sched: Running job scheduler
[2014-05-19T16:13:38.762] debug:  Spawning ping agent for mfadmin01
[2014-05-19T16:14:05.000] debug:  backfill: beginning
[2014-05-19T16:14:05.000] debug:  backfill: no jobs to backfill
[2014-05-19T16:14:35.772] debug:  sched: Running job scheduler
[2014-05-19T16:15:35.784] debug:  sched: Running job scheduler
[2014-05-19T16:16:35.796] debug:  sched: Running job scheduler
[2014-05-19T16:16:58.800] debug:  Spawning ping agent for mfadmin01
[2014-05-19T16:17:05.000] debug:  backfill: beginning
[2014-05-19T16:17:05.000] debug:  backfill: no jobs to backfill
[2014-05-19T16:17:35.807] debug:  sched: Running job scheduler
[2014-05-19T16:18:35.818] debug:  sched: Running job scheduler
[2014-05-19T16:19:35.829] debug:  sched: Running job scheduler
[2014-05-19T16:20:18.838] debug:  Spawning ping agent for mfadmin01
[2014-05-19T16:20:35.000] debug:  backfill: beginning
[2014-05-19T16:20:35.000] debug:  backfill: no jobs to backfill
[2014-05-19T16:20:35.841] debug:  sched: Running job scheduler
[2014-05-19T16:21:35.852] debug:  sched: Running job scheduler
[2014-05-19T16:22:35.861] debug:  sched: Running job scheduler
[2014-05-19T16:23:35.869] debug:  sched: Running job scheduler
[2014-05-19T16:23:38.873] debug:  Spawning ping agent for mfadmin01
[2014-05-19T16:24:05.000] debug:  backfill: beginning
[2014-05-19T16:24:05.000] debug:  backfill: no jobs to backfill
[2014-05-19T16:24:35.884] debug:  sched: Running job scheduler
[2014-05-19T16:25:18.892] debug:  Spawning registration agent for 
mfadmin01,mfc[0001,0003,0005] 4 hosts
[2014-05-19T16:25:35.000] debug:  backfill: beginning
[2014-05-19T16:25:35.000] debug:  backfill: no jobs to backfill
[2014-05-19T16:25:35.895] debug:  sched: Running job scheduler
[2014-05-19T16:26:35.906] debug:  sched: Running job scheduler
[2014-05-19T16:27:35.917] debug:  sched: Running job scheduler

# cat /tmp/mySlurmdLogFile on the headnode(mfadmin01)
[2014-05-19T15:53:35.546] topology NONE plugin loaded
[2014-05-19T15:53:35.546] CPU frequency setting not configured for this node
[2014-05-19T15:53:35.546] task NONE plugin loaded
[2014-05-19T15:53:35.546] auth plugin for Munge 
(http://code.google.com/p/munge/) loaded
[2014-05-19T15:53:35.546] debug:  spank: opening plugin stack 
/etc/slurm/plugstack.conf
[2014-05-19T15:53:35.546] Munge cryptographic signature plugin loaded
[2014-05-19T15:53:35.547] Warning: Core limit is only 0 KB
[2014-05-19T15:53:35.547] slurmd version 14.03.3-2 started
[2014-05-19T15:53:35.548] Job accounting gather NOT_INVOKED plugin loaded
[2014-05-19T15:53:35.548] debug:  job_container none plugin loaded
[2014-05-19T15:53:35.548] switch NONE plugin loaded
[2014-05-19T15:53:35.552] slurmd started on Mon, 19 May 2014 15:53:35 -0500
[2014-05-19T15:53:35.552] CPUs=8 Boards=1 Sockets=2 Cores=4 Threads=1 
Memory=24018 TmpDisk=458356 Uptime=1210532
[2014-05-19T15:53:35.552] AcctGatherEnergy NONE plugin loaded
[2014-05-19T15:53:35.552] AcctGatherProfile NONE plugin loaded
[2014-05-19T15:53:35.552] AcctGatherInfiniband NONE plugin loaded
[2014-05-19T15:53:35.552] AcctGatherFilesystem NONE plugin loaded



Here is munge/unmunge output to the compute node:
# munge -n | ssh mfc0001 unmunge
root@mfc0001's password:
STATUS:           Success (0)
ENCODE_HOST:      mfadmin01.xxx.xxx (xxx.xxx.xx.11)
ENCODE_TIME:      2014-05-19 16:14:40 -0500 (1400534080)
DECODE_TIME:      2014-05-19 16:14:47 -0500 (1400534087)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha1 (3)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           0


My compute nodes lookup to head node for dns lookup. Is this a problem??? My 
DNS works fine from the compute node I have verified that. Basically /etc/hosts 
file on my compute node only contains localhost and localdomain entries.

Here is my SLURMD log

[root@mfc0001 ~]# cat /tmp/mySlurmdLogFile
[2014-05-19T15:55:17.800] topology NONE plugin loaded
[2014-05-19T15:55:17.801] CPU frequency setting not configured for this node
[2014-05-19T15:55:17.808] task NONE plugin loaded
[2014-05-19T15:55:17.813] auth plugin for Munge 
(http://code.google.com/p/munge/) loaded
[2014-05-19T15:55:17.814] debug:  spank: opening plugin stack 
/etc/slurm/plugstack.conf
[2014-05-19T15:55:17.817] Munge cryptographic signature plugin loaded
[2014-05-19T15:55:17.823] Warning: Core limit is only 0 KB
[2014-05-19T15:55:17.823] slurmd version 14.03.3-2 started
[2014-05-19T15:55:17.829] Job accounting gather NOT_INVOKED plugin loaded
[2014-05-19T15:55:17.832] debug:  job_container none plugin loaded
[2014-05-19T15:55:17.840] switch NONE plugin loaded
[2014-05-19T15:55:17.841] slurmd started on Mon, 19 May 2014 15:55:17 -0500
[2014-05-19T15:55:17.842] CPUs=8 Boards=1 Sockets=2 Cores=4 Threads=1 
Memory=24145 TmpDisk=12072 Uptime=51
[2014-05-19T15:55:17.848] AcctGatherEnergy NONE plugin loaded
[2014-05-19T15:55:17.852] AcctGatherProfile NONE plugin loaded
[2014-05-19T15:55:17.855] AcctGatherInfiniband NONE plugin loaded
[2014-05-19T15:55:17.858] AcctGatherFilesystem NONE plugin loaded

Any insight into fixing this will be a great help for me to progress in my 
effort here. Please let me know if additional information would help.

Best Regards,
Amit




Reply via email to