Hi, I'm trying to set up a heterogeneous cluster, but I'm having some trouble with keeping some of the nodes up. All nodes have slurm 2.3.2 installed, they all use the same slurm.conf file (ln from an nfs share).
Here's sinfo => [root@ORL-GASLDGEN1 ~]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST orl* up infinite 1 comp* ORLGAS1 orl* up infinite 3 idle* ORL-JMTR[2-4] orl* up infinite 4 down* ORL-APP[3,5-6],ORL-STORE orl* up infinite 9 idle ORL-APP[2,4,01],ORL-GASLDGEN[1-4],ORL-JMTRLDGEN1-38,ORLGAS2 Four nodes are down, so I get their info: [root@ORL-GASLDGEN1 ~]# scontrol show node ORL-APP[3,5-6],ORL-STORE NodeName=ORL-APP3 Arch=i686 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=4 Features=(null) Gres=(null) NodeAddr=192.168.206.45 NodeHostName=ORL-APP3 OS=Linux RealMemory=461 Sockets=4 State=DOWN* ThreadsPerCore=1 TmpDisk=3852 Weight=1 BootTime=2011-10-31T20:22:58 SlurmdStartTime=2012-01-17T17:37:56 Reason=Low socket*core*thread count [slurm@2012-01-17T17:37:00] NodeName=ORL-APP5 Arch=i686 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=4 Features=(null) Gres=(null) NodeAddr=192.168.206.47 NodeHostName=ORL-APP5 OS=Linux RealMemory=629 Sockets=4 State=DOWN* ThreadsPerCore=1 TmpDisk=3852 Weight=1 BootTime=2011-10-31T20:23:06 SlurmdStartTime=2012-01-17T17:42:55 Reason=Low socket*core*thread count [slurm@2012-01-17T17:38:27] NodeName=ORL-APP6 Arch=i686 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=4 Features=(null) Gres=(null) NodeAddr=192.168.206.48 NodeHostName=ORL-APP6 OS=Linux RealMemory=629 Sockets=4 State=DOWN* ThreadsPerCore=1 TmpDisk=3852 Weight=1 BootTime=2011-10-31T20:23:06 SlurmdStartTime=2012-01-17T17:43:22 Reason=Low socket*core*thread count [slurm@2012-01-17T17:39:30] NodeName=ORL-STORE Arch=i686 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=2 Features=(null) Gres=(null) NodeAddr=192.168.206.39 NodeHostName=ORL-STORE OS=Linux RealMemory=750 Sockets=2 State=DOWN* ThreadsPerCore=1 TmpDisk=3852 Weight=1 BootTime=2011-10-31T20:22:40 SlurmdStartTime=2012-01-17T17:34:36 Reason=Low socket*core*thread count [slurm@2012-01-17T17:34:36] They all say 'Low socket*core*thread count', I have checked and double-checked the configuration of these machines. Here's the bottom of the slurm.conf. I don't see the problem with this configuration, but perhaps I'm just being a newbie -- hoping someone on the list might be able to point out the issue? Thanks in advance, Davis # COMPUTE NODES ############################## # 2 CPUs, 1 Core/CPU, 1 ThreadsPerCore ############################## NodeName=ORL-STORE NodeAddr=192.168.206.39 Procs=2 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=500 TmpDisk=200 State=UNKNOWN ############################## # 2 CPUs, 1 Core/CPU, 2 ThreadsPerCore ############################## NodeName=ORLGAS[1-2] NodeAddr=192.168.206.[32-33] Procs=2 CoresPerSocket=1 ThreadsPerCore=2 Realmemory=500 TmpDisk=200 State=UNKNOWN ############################## # 2 CPUs, 2 Cores/CPU ############################## NodeName=ORL-GASLDGEN[1-4],ORL-APP[01,2] NodeAddr=192.168.206.[34-37,43-44] Procs=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=300 TmpDisk=200 State=UNKNOWN ############################## # 4 CPUs, 1 Core/CPU ############################## NodeName=ORL-APP[3,5-6] NodeAddr=192.168.206.[45,47-48] Procs=4 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=300 TmpDisk=200 State=UNKNOWN ############################## # 4 CPUs, 4 Cores/CPU ############################## NodeName=ORL-JMTRLDGEN1-38,ORL-JMTR[2-4],ORL-APP4 NodeAddr=192.168.206.[38,40-42,46] Procs=4 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=300 TmpDisk=200 State=UNKNOWN PartitionName=orl Nodes=ORL-GASLDGEN[1-4],ORLGAS[1-2],ORL-JMTRLDGEN1-38,ORL-STORE,ORL-JMTR[2-4],ORL-APP[01,2-6] Default=YES MaxTime=INFINITE State=UP
