Hi, thanks for you guys' reply.
As you suggested, I set ReturnToService to 2, after restarting slurm, I
have a node back. But still have another 2 down. The status is still
Reason=Node unexpectedly rebooted [slurm@2012-11-04T22:05:38].
I run slurmd -Dvvvv on a problematic node, it gives error like this:
----------------------
slurmd: debug: _slurm_recv_timeout at 0 of 4, recv zero bytes
slurmd: error: slurm_receive_msg: Zero Bytes were transmitted or received
slurmd: error: Unable to register: Zero Bytes were transmitted or received
slurmd: debug: Unable to register with slurm controller, retrying
----------------------
It looks communication has problem. But on that node, munge works (munge
-n |unmunge return success), ssh also works.
I really don't know where is the problem.
The version of Slurm is 2.4.3.
Config file as follows, I removed most of the comment lines
-----------------------
ControlMachine=sirius
ControlAddr=192.168.1.100
AuthType=auth/munge
CacheGroups=1
CryptoType=crypto/munge
Epilog=/home/Software/etc/slurm.epilog.clean
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompType=jobcomp/none
#JobCompUser=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
NodeName=node-[1-3] NodeAddr=192.168.1.[101-103] CPUs=1 Sockets=1
CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
PartitionName=datasys Nodes=node-[1-3] Default=YES MaxTime=INFINITE State=UP
----------------------
On 11/05/2012 01:33 PM, Andy Riebs wrote:
Tony,
For slurm problems, it's generally very important to list
* SLURM version
* slurm.conf
In this case, it looks like you probably want to set "ReturnToService"
in slurm.conf.
Happy Slurming!
Andy
On 11/05/2012 01:38 PM, Tony wrote:
Hello,
I'm a PhD student in CS department, Illinois Institute of Technology.
Recently I'm trying to use Slurm on my virtual cluster which has 92
nodes. I successfully installed Munge and Slurm on all nodes. It seems
everything's fine. But after a system reboot Slurm stops working.
Sinfo shows all nodes are down. scontrol show nodes gives info like this:
NodeName=node-1 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUErr=0 CPUTot=1 Features=(null)
Gres=(null)
NodeAddr=192.168.1.101 NodeHostName=node-1
OS=Linux RealMemory=1 Sockets=1
State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1
BootTime=2012-11-04T22:05:09 SlurmdStartTime=2012-11-05T06:49:45
Reason=Node unexpectedly rebooted [slurm@2012-11-04T21:17:27]
I googled the reason but didn't find any useful info. I grep the source
code and find it happeds when the
src/slurmctld/node_mgr.c:node_ptr->reason is false, which means no reason?
Could you do me a favour and have a look on this problem?
Thanks a lot,
-Tony
--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1-786-263-9743
My opinions are not necessarily those of HP