You are ignoring the "boot_time" in the if test. See "ReturnToService" in "man slurm.conf".
Quoting Tony <[email protected]>: > > Hello, > I'm a PhD student in CS department, Illinois Institute of Technology. > Recently I'm trying to use Slurm on my virtual cluster which has 92 > nodes. I successfully installed Munge and Slurm on all nodes. It seems > everything's fine. But after a system reboot Slurm stops working. > Sinfo shows all nodes are down. scontrol show nodes gives info like this: > > NodeName=node-1 Arch=x86_64 CoresPerSocket=1 > CPUAlloc=0 CPUErr=0 CPUTot=1 Features=(null) > Gres=(null) > NodeAddr=192.168.1.101 NodeHostName=node-1 > OS=Linux RealMemory=1 Sockets=1 > State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 > BootTime=2012-11-04T22:05:09 SlurmdStartTime=2012-11-05T06:49:45 > Reason=Node unexpectedly rebooted [slurm@2012-11-04T21:17:27] > > I googled the reason but didn't find any useful info. I grep the source > code and find it happeds when the > src/slurmctld/node_mgr.c:node_ptr->reason is false, which means no reason? > > Could you do me a favour and have a look on this problem? > > Thanks a lot, > -Tony >
