Tony,
For slurm problems, it's generally very important to list
* SLURM version
* slurm.conf
In this case, it looks like you probably want to set "ReturnToService"
in slurm.conf.
Happy Slurming!
Andy
On 11/05/2012 01:38 PM, Tony wrote:
Hello,
I'm a PhD student in CS department, Illinois Institute of Technology.
Recently I'm trying to use Slurm on my virtual cluster which has 92
nodes. I successfully installed Munge and Slurm on all nodes. It seems
everything's fine. But after a system reboot Slurm stops working.
Sinfo shows all nodes are down. scontrol show nodes gives info like this:
NodeName=node-1 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUErr=0 CPUTot=1 Features=(null)
Gres=(null)
NodeAddr=192.168.1.101 NodeHostName=node-1
OS=Linux RealMemory=1 Sockets=1
State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1
BootTime=2012-11-04T22:05:09 SlurmdStartTime=2012-11-05T06:49:45
Reason=Node unexpectedly rebooted [slurm@2012-11-04T21:17:27]
I googled the reason but didn't find any useful info. I grep the source
code and find it happeds when the
src/slurmctld/node_mgr.c:node_ptr->reason is false, which means no reason?
Could you do me a favour and have a look on this problem?
Thanks a lot,
-Tony
--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1-786-263-9743
My opinions are not necessarily those of HP