Try starting the slurmd manually with -C option to print the counts on each compute node:
$ ./slurmd -C
NodeName=tux4 Procs=6 Sockets=1 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=8000 TmpDisk=930837

You can also try starting slurmctld with more verbose logging:
$ ./slurmctld -vvv


Quoting Davis Ford <davisf...@gmail.com>:

Hi, I'm trying to set up a heterogeneous cluster, but I'm having some
trouble with keeping some of the nodes up.  All nodes have slurm 2.3.2
installed, they all use the same slurm.conf file (ln from an nfs share).

Here's sinfo =>

[root@ORL-GASLDGEN1 ~]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
orl*         up   infinite      1  comp* ORLGAS1
orl*         up   infinite      3  idle* ORL-JMTR[2-4]
orl*         up   infinite      4  down* ORL-APP[3,5-6],ORL-STORE
orl*         up   infinite      9   idle
ORL-APP[2,4,01],ORL-GASLDGEN[1-4],ORL-JMTRLDGEN1-38,ORLGAS2

Four nodes are down, so I get their info:

[root@ORL-GASLDGEN1 ~]# scontrol show node ORL-APP[3,5-6],ORL-STORE
NodeName=ORL-APP3 Arch=i686 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=4 Features=(null)
   Gres=(null)
   NodeAddr=192.168.206.45 NodeHostName=ORL-APP3
   OS=Linux RealMemory=461 Sockets=4
   State=DOWN* ThreadsPerCore=1 TmpDisk=3852 Weight=1
   BootTime=2011-10-31T20:22:58 SlurmdStartTime=2012-01-17T17:37:56
   Reason=Low socket*core*thread count [slurm@2012-01-17T17:37:00]

NodeName=ORL-APP5 Arch=i686 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=4 Features=(null)
   Gres=(null)
   NodeAddr=192.168.206.47 NodeHostName=ORL-APP5
   OS=Linux RealMemory=629 Sockets=4
   State=DOWN* ThreadsPerCore=1 TmpDisk=3852 Weight=1
   BootTime=2011-10-31T20:23:06 SlurmdStartTime=2012-01-17T17:42:55
   Reason=Low socket*core*thread count [slurm@2012-01-17T17:38:27]

NodeName=ORL-APP6 Arch=i686 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=4 Features=(null)
   Gres=(null)
   NodeAddr=192.168.206.48 NodeHostName=ORL-APP6
   OS=Linux RealMemory=629 Sockets=4
   State=DOWN* ThreadsPerCore=1 TmpDisk=3852 Weight=1
   BootTime=2011-10-31T20:23:06 SlurmdStartTime=2012-01-17T17:43:22
   Reason=Low socket*core*thread count [slurm@2012-01-17T17:39:30]

NodeName=ORL-STORE Arch=i686 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=2 Features=(null)
   Gres=(null)
   NodeAddr=192.168.206.39 NodeHostName=ORL-STORE
   OS=Linux RealMemory=750 Sockets=2
   State=DOWN* ThreadsPerCore=1 TmpDisk=3852 Weight=1
   BootTime=2011-10-31T20:22:40 SlurmdStartTime=2012-01-17T17:34:36
   Reason=Low socket*core*thread count [slurm@2012-01-17T17:34:36]

They all say 'Low socket*core*thread count', I have checked and
double-checked the configuration of these machines.  Here's the bottom of
the slurm.conf.  I don't see the problem with this configuration, but
perhaps I'm just being a newbie -- hoping someone on the list might be able
to point out the issue?

Thanks in advance,
Davis

# COMPUTE NODES
##############################
# 2 CPUs, 1 Core/CPU, 1 ThreadsPerCore
##############################
NodeName=ORL-STORE NodeAddr=192.168.206.39 Procs=2 CoresPerSocket=1
ThreadsPerCore=1 RealMemory=500 TmpDisk=200 State=UNKNOWN

##############################
# 2 CPUs, 1 Core/CPU, 2 ThreadsPerCore
##############################
NodeName=ORLGAS[1-2] NodeAddr=192.168.206.[32-33] Procs=2 CoresPerSocket=1
ThreadsPerCore=2 Realmemory=500 TmpDisk=200 State=UNKNOWN

##############################
# 2 CPUs, 2 Cores/CPU
##############################
NodeName=ORL-GASLDGEN[1-4],ORL-APP[01,2] NodeAddr=192.168.206.[34-37,43-44]
Procs=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=300 TmpDisk=200
State=UNKNOWN

##############################
# 4 CPUs, 1 Core/CPU
##############################
NodeName=ORL-APP[3,5-6] NodeAddr=192.168.206.[45,47-48] Procs=4
CoresPerSocket=1 ThreadsPerCore=1 RealMemory=300 TmpDisk=200 State=UNKNOWN

##############################
# 4 CPUs, 4 Cores/CPU
##############################
NodeName=ORL-JMTRLDGEN1-38,ORL-JMTR[2-4],ORL-APP4
NodeAddr=192.168.206.[38,40-42,46] Procs=4 CoresPerSocket=4
ThreadsPerCore=1 RealMemory=300 TmpDisk=200 State=UNKNOWN

PartitionName=orl
Nodes=ORL-GASLDGEN[1-4],ORLGAS[1-2],ORL-JMTRLDGEN1-38,ORL-STORE,ORL-JMTR[2-4],ORL-APP[01,2-6]
Default=YES MaxTime=INFINITE State=UP




Reply via email to