[slurm-dev] Re: Slurm not working: Reason=Node unexpectedly rebooted

Tony Mon, 05 Nov 2012 12:49:10 -0800

Hi, thanks for you guys' reply.

As you suggested, I set ReturnToService to 2, after restarting slurm, Ihave a node back. But still have another 2 down. The status is stillReason=Node unexpectedly rebooted [slurm@2012-11-04T22:05:38].


I run slurmd -Dvvvv on a problematic node, it gives error like this:
----------------------
slurmd: debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
slurmd: error: slurm_receive_msg: Zero Bytes were transmitted or received
slurmd: error: Unable to register: Zero Bytes were transmitted or received
slurmd: debug:  Unable to register with slurm controller, retrying
----------------------

It looks communication has problem. But on that node, munge works (munge-n |unmunge return success), ssh also works.

I really don't know where  is the problem.

The version of Slurm is 2.4.3.
Config file as follows, I removed most of the comment lines
-----------------------
ControlMachine=sirius
ControlAddr=192.168.1.100
AuthType=auth/munge
CacheGroups=1
CryptoType=crypto/munge
Epilog=/home/Software/etc/slurm.epilog.clean
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompType=jobcomp/none
#JobCompUser=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log

NodeName=node-[1-3] NodeAddr=192.168.1.[101-103] CPUs=1 Sockets=1CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN

PartitionName=datasys Nodes=node-[1-3] Default=YES MaxTime=INFINITE State=UP
----------------------




On 11/05/2012 01:33 PM, Andy Riebs wrote:

Tony,

For slurm problems, it's generally very important to list

  * SLURM version
  * slurm.conf

In this case, it looks like you probably want to set "ReturnToService"in slurm.conf.


Happy Slurming!
Andy

On 11/05/2012 01:38 PM, Tony wrote:

Hello,
I'm a PhD student in CS department, Illinois Institute of Technology.
Recently I'm trying to use Slurm on my virtual cluster which has 92
nodes. I successfully installed Munge and Slurm on all nodes. It seems
everything's fine. But after a system reboot Slurm stops working.
Sinfo shows all nodes are down. scontrol show nodes gives info like this:

NodeName=node-1 Arch=x86_64 CoresPerSocket=1
     CPUAlloc=0 CPUErr=0 CPUTot=1 Features=(null)
     Gres=(null)
     NodeAddr=192.168.1.101 NodeHostName=node-1
     OS=Linux RealMemory=1 Sockets=1
     State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1
     BootTime=2012-11-04T22:05:09 SlurmdStartTime=2012-11-05T06:49:45
     Reason=Node unexpectedly rebooted [slurm@2012-11-04T21:17:27]

I googled the reason but didn't find any useful info. I grep the source
code and find it happeds when the
src/slurmctld/node_mgr.c:node_ptr->reason is false, which means no reason?

Could you do me a favour and have a look on this problem?

Thanks a lot,
-Tony


--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1-786-263-9743
My opinions are not necessarily those of HP

[slurm-dev] Re: Slurm not working: Reason=Node unexpectedly rebooted

Reply via email to