Wait, I'm sorry, I must be missing something, please bear with me!
By the way, your discussion of groups 1 and 2 below is wrong. Group 2 doesn't
say that NUMA node == socket, and it doesn't report 8 sockets of 8 cores each.
It reports 4 sockets containing 2 NUMA nodes each containing 8 cores
You have found what we found (also in other areas of OpenMPI) that Slurm
has some interesting behaviors.
If it was easy, anyone could do it
Ken
==
Kenneth A. Lloyd, Jr.
CEO - Director, Systems Science
Watt Systems Technologies Inc.
From: hwloc-users
Le 28/05/2014 14:57, Craig Kapfer a écrit :
>
>
> Hmm ... the slurm config defines that all nodes have 4 sockets with 16
> cores per socket (which corresponds to the hardware--all nodes are the
> same). Slurm node config is as follows:
>
> NodeName=n[001-008] RealMemory=258452 Sockets=4
Hmm ... the slurm config defines that all nodes have 4 sockets with 16 cores
per socket (which corresponds to the hardware--all nodes are the same). Slurm
node config is as follows:
NodeName=n[001-008] RealMemory=258452 Sockets=4 CoresPerSocket=16
ThreadsPerCore=1 State=UNKNOWN
Le 28/05/2014 14:13, Craig Kapfer a écrit :
> Interesting, quite right, thank you very much. Yes these are AMD 6300
> series. Same kernel but these boxes seem to have different BIOS
> versions, direct from the factory, delivered in the same physical
> enclosure even! Some are AMI 3.5 and some
Interesting, quite right, thank you very much. Yes these are AMD 6300 series.
Same kernel but these boxes seem to have different BIOS versions, direct from
the factory, delivered in the same physical enclosure even! Some are AMI 3.5
and some are 3.0.
So slurm is then incorrectly parsing
Aside of the BIOS config, are you sure that you have the exact same BIOS
*version* in each node? (can check in /sys/class/dmi/id/bios_*) Same
Linux kernel too?
Also, recently we've seen somebody fix such problems by unplugging and
replugging some CPUs on the motherboard. Seems crazy but it
We have a bunch of 64-core (quad-socket, 16 cores/socket) AMD servers and some
of them are reporting the following error from slurm, which I gather gets its
info from hwloc:
May 27 11:53:04 n001 slurmd[3629]: Node configuration differs from hardware:
CPUs=64:64(hw) Boards=1:1(hw)