You are ignoring the "boot_time" in the if test. See "ReturnToService"  
in "man slurm.conf".

Quoting Tony <[email protected]>:

>
> Hello,
> I'm a PhD student in CS department, Illinois Institute of Technology.
> Recently I'm trying to use Slurm on my virtual cluster which has 92
> nodes. I successfully installed Munge and Slurm on all nodes. It seems
> everything's fine. But after a system reboot Slurm stops working.
> Sinfo shows all nodes are down. scontrol show nodes gives info like this:
>
> NodeName=node-1 Arch=x86_64 CoresPerSocket=1
>     CPUAlloc=0 CPUErr=0 CPUTot=1 Features=(null)
>     Gres=(null)
>     NodeAddr=192.168.1.101 NodeHostName=node-1
>     OS=Linux RealMemory=1 Sockets=1
>     State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1
>     BootTime=2012-11-04T22:05:09 SlurmdStartTime=2012-11-05T06:49:45
>     Reason=Node unexpectedly rebooted [slurm@2012-11-04T21:17:27]
>
> I googled the reason but didn't find any useful info. I grep the source
> code and find it happeds when the
> src/slurmctld/node_mgr.c:node_ptr->reason is false, which means no reason?
>
> Could you do me a favour and have a look on this problem?
>
> Thanks a lot,
> -Tony
>

Reply via email to