Tony,

For slurm problems, it's generally very important to list

 * SLURM version
 * slurm.conf

In this case, it looks like you probably want to set "ReturnToService" in slurm.conf.

Happy Slurming!
Andy

On 11/05/2012 01:38 PM, Tony wrote:
Hello,
I'm a PhD student in CS department, Illinois Institute of Technology.
Recently I'm trying to use Slurm on my virtual cluster which has 92
nodes. I successfully installed Munge and Slurm on all nodes. It seems
everything's fine. But after a system reboot Slurm stops working.
Sinfo shows all nodes are down. scontrol show nodes gives info like this:

NodeName=node-1 Arch=x86_64 CoresPerSocket=1
     CPUAlloc=0 CPUErr=0 CPUTot=1 Features=(null)
     Gres=(null)
     NodeAddr=192.168.1.101 NodeHostName=node-1
     OS=Linux RealMemory=1 Sockets=1
     State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1
     BootTime=2012-11-04T22:05:09 SlurmdStartTime=2012-11-05T06:49:45
     Reason=Node unexpectedly rebooted [slurm@2012-11-04T21:17:27]

I googled the reason but didn't find any useful info. I grep the source
code and find it happeds when the
src/slurmctld/node_mgr.c:node_ptr->reason is false, which means no reason?

Could you do me a favour and have a look on this problem?

Thanks a lot,
-Tony

--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1-786-263-9743
My opinions are not necessarily those of HP

Reply via email to