Try starting the slurmd manually with -C option to print the counts on
each compute node:
$ ./slurmd -C
NodeName=tux4 Procs=6 Sockets=1 CoresPerSocket=6 ThreadsPerCore=1
RealMemory=8000 TmpDisk=930837
You can also try starting slurmctld with more verbose logging:
$ ./slurmctld -vvv
Quoting Davis Ford <davisf...@gmail.com>:
Hi, I'm trying to set up a heterogeneous cluster, but I'm having some
trouble with keeping some of the nodes up. All nodes have slurm 2.3.2
installed, they all use the same slurm.conf file (ln from an nfs share).
Here's sinfo =>
[root@ORL-GASLDGEN1 ~]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
orl* up infinite 1 comp* ORLGAS1
orl* up infinite 3 idle* ORL-JMTR[2-4]
orl* up infinite 4 down* ORL-APP[3,5-6],ORL-STORE
orl* up infinite 9 idle
ORL-APP[2,4,01],ORL-GASLDGEN[1-4],ORL-JMTRLDGEN1-38,ORLGAS2
Four nodes are down, so I get their info:
[root@ORL-GASLDGEN1 ~]# scontrol show node ORL-APP[3,5-6],ORL-STORE
NodeName=ORL-APP3 Arch=i686 CoresPerSocket=1
CPUAlloc=0 CPUErr=0 CPUTot=4 Features=(null)
Gres=(null)
NodeAddr=192.168.206.45 NodeHostName=ORL-APP3
OS=Linux RealMemory=461 Sockets=4
State=DOWN* ThreadsPerCore=1 TmpDisk=3852 Weight=1
BootTime=2011-10-31T20:22:58 SlurmdStartTime=2012-01-17T17:37:56
Reason=Low socket*core*thread count [slurm@2012-01-17T17:37:00]
NodeName=ORL-APP5 Arch=i686 CoresPerSocket=1
CPUAlloc=0 CPUErr=0 CPUTot=4 Features=(null)
Gres=(null)
NodeAddr=192.168.206.47 NodeHostName=ORL-APP5
OS=Linux RealMemory=629 Sockets=4
State=DOWN* ThreadsPerCore=1 TmpDisk=3852 Weight=1
BootTime=2011-10-31T20:23:06 SlurmdStartTime=2012-01-17T17:42:55
Reason=Low socket*core*thread count [slurm@2012-01-17T17:38:27]
NodeName=ORL-APP6 Arch=i686 CoresPerSocket=1
CPUAlloc=0 CPUErr=0 CPUTot=4 Features=(null)
Gres=(null)
NodeAddr=192.168.206.48 NodeHostName=ORL-APP6
OS=Linux RealMemory=629 Sockets=4
State=DOWN* ThreadsPerCore=1 TmpDisk=3852 Weight=1
BootTime=2011-10-31T20:23:06 SlurmdStartTime=2012-01-17T17:43:22
Reason=Low socket*core*thread count [slurm@2012-01-17T17:39:30]
NodeName=ORL-STORE Arch=i686 CoresPerSocket=1
CPUAlloc=0 CPUErr=0 CPUTot=2 Features=(null)
Gres=(null)
NodeAddr=192.168.206.39 NodeHostName=ORL-STORE
OS=Linux RealMemory=750 Sockets=2
State=DOWN* ThreadsPerCore=1 TmpDisk=3852 Weight=1
BootTime=2011-10-31T20:22:40 SlurmdStartTime=2012-01-17T17:34:36
Reason=Low socket*core*thread count [slurm@2012-01-17T17:34:36]
They all say 'Low socket*core*thread count', I have checked and
double-checked the configuration of these machines. Here's the bottom of
the slurm.conf. I don't see the problem with this configuration, but
perhaps I'm just being a newbie -- hoping someone on the list might be able
to point out the issue?
Thanks in advance,
Davis
# COMPUTE NODES
##############################
# 2 CPUs, 1 Core/CPU, 1 ThreadsPerCore
##############################
NodeName=ORL-STORE NodeAddr=192.168.206.39 Procs=2 CoresPerSocket=1
ThreadsPerCore=1 RealMemory=500 TmpDisk=200 State=UNKNOWN
##############################
# 2 CPUs, 1 Core/CPU, 2 ThreadsPerCore
##############################
NodeName=ORLGAS[1-2] NodeAddr=192.168.206.[32-33] Procs=2 CoresPerSocket=1
ThreadsPerCore=2 Realmemory=500 TmpDisk=200 State=UNKNOWN
##############################
# 2 CPUs, 2 Cores/CPU
##############################
NodeName=ORL-GASLDGEN[1-4],ORL-APP[01,2] NodeAddr=192.168.206.[34-37,43-44]
Procs=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=300 TmpDisk=200
State=UNKNOWN
##############################
# 4 CPUs, 1 Core/CPU
##############################
NodeName=ORL-APP[3,5-6] NodeAddr=192.168.206.[45,47-48] Procs=4
CoresPerSocket=1 ThreadsPerCore=1 RealMemory=300 TmpDisk=200 State=UNKNOWN
##############################
# 4 CPUs, 4 Cores/CPU
##############################
NodeName=ORL-JMTRLDGEN1-38,ORL-JMTR[2-4],ORL-APP4
NodeAddr=192.168.206.[38,40-42,46] Procs=4 CoresPerSocket=4
ThreadsPerCore=1 RealMemory=300 TmpDisk=200 State=UNKNOWN
PartitionName=orl
Nodes=ORL-GASLDGEN[1-4],ORLGAS[1-2],ORL-JMTRLDGEN1-38,ORL-STORE,ORL-JMTR[2-4],ORL-APP[01,2-6]
Default=YES MaxTime=INFINITE State=UP