Re: [hwloc-devel] Strange CPU topology numbering on dual socket ARM server with 2×ThunderX2 CN9975

2019-09-06 Thread Jirka Hladky
>
> Yes, that is why hwloc provides both, and hwloc-calc can be used to
> convert between them for instance.


Exactly! I use hwloc-calc quite heavily to connect both worlds.

On Fri, Sep 6, 2019 at 5:21 PM Samuel Thibault 
wrote:

> Jirka Hladky, le ven. 06 sept. 2019 16:52:30 +0200, a ecrit:
> > The trouble is that other Linux tools (like ps) are using the physical
> > numbering.
>
> Yes, that is why hwloc provides both, and hwloc-calc can be used to
> convert between them for instance.
>
> Samuel
> ___
> hwloc-devel mailing list
> hwloc-devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-devel
>


-- 
-Jirka
___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] Strange CPU topology numbering on dual socket ARM server with 2×ThunderX2 CN9975

2019-09-06 Thread Samuel Thibault
Jirka Hladky, le ven. 06 sept. 2019 16:52:30 +0200, a ecrit:
> The trouble is that other Linux tools (like ps) are using the physical
> numbering.

Yes, that is why hwloc provides both, and hwloc-calc can be used to
convert between them for instance.

Samuel
___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel


Re: [hwloc-devel] Strange CPU topology numbering on dual socket ARM server with 2×ThunderX2 CN9975

2019-09-06 Thread Jeff Squyres (jsquyres) via hwloc-devel
On Sep 6, 2019, at 10:52 AM, Jirka Hladky 
mailto:jhla...@redhat.com>> wrote:

You should avoid physical numbering at any cost.

The trouble is that other Linux tools (like ps) are using the physical 
numbering. I will need to think about how to come around this.

Use hwloc for everything!  ;-)

(I'm only half-kidding, actually -- FWIW, I use hwloc for everything these 
days.  The logical ordering alone makes it worthwhile.  The simplicity of 
hwloc-bind's socket:X,core:Y is even more lovely.  ...etc.)

--
Jeff Squyres
jsquy...@cisco.com

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] Strange CPU topology numbering on dual socket ARM server with 2×ThunderX2 CN9975

2019-09-06 Thread Jirka Hladky
>
> You should avoid physical numbering at any cost.


The trouble is that other Linux tools (like ps) are using the physical
numbering. I will need to think about how to come around this.

On Fri, Sep 6, 2019 at 4:46 PM Guillaume Mercier <
guillaume.merc...@u-bordeaux.fr> wrote:

>
> Hi,
>
> You should avoid physical numbering at any cost.
>
> Guillaume
>
> On 9/6/19 4:38 PM, Jirka Hladky wrote:
> > Thanks for the feedback! I have never seen anything like that so I have
> > assumed it's a bug:-)
> >
> > I was already thinking about using the logical numbering - it's probably
> > the best solution.
> >
> > Merci beaucoup!
> > Jirka
> >
> > On Fri, Sep 6, 2019 at 4:13 PM Samuel Thibault  > > wrote:
> >
> > Brice Goglin, le ven. 06 sept. 2019 16:07:13 +0200, a ecrit:
> >  > physical_package_id don't have to be between 0 and N-1,
> >
> > Which is the very reason for the logical IDs that hwloc provide :)
> >
> > Samuel
> > ___
> > hwloc-devel mailing list
> > hwloc-devel@lists.open-mpi.org  hwloc-devel@lists.open-mpi.org>
> > https://lists.open-mpi.org/mailman/listinfo/hwloc-devel
> >
> >
> >
> > --
> > -Jirka
> >
> > ___
> > hwloc-devel mailing list
> > hwloc-devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/hwloc-devel
> >
>


-- 
-Jirka
___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] Strange CPU topology numbering on dual socket ARM server with 2×ThunderX2 CN9975

2019-09-06 Thread Jirka Hladky
Thanks for the feedback! I have never seen anything like that so I have
assumed it's a bug:-)

I was already thinking about using the logical numbering - it's probably
the best solution.

Merci beaucoup!
Jirka

On Fri, Sep 6, 2019 at 4:13 PM Samuel Thibault 
wrote:

> Brice Goglin, le ven. 06 sept. 2019 16:07:13 +0200, a ecrit:
> > physical_package_id don't have to be between 0 and N-1,
>
> Which is the very reason for the logical IDs that hwloc provide :)
>
> Samuel
> ___
> hwloc-devel mailing list
> hwloc-devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-devel
>


-- 
-Jirka
___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] Strange CPU topology numbering on dual socket ARM server with 2×ThunderX2 CN9975

2019-09-06 Thread Samuel Thibault
Brice Goglin, le ven. 06 sept. 2019 16:07:13 +0200, a ecrit:
> physical_package_id don't have to be between 0 and N-1,

Which is the very reason for the logical IDs that hwloc provide :)

Samuel
___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel


Re: [hwloc-devel] Strange CPU topology numbering on dual socket ARM server with 2×ThunderX2 CN9975

2019-09-06 Thread Jeff Squyres (jsquyres) via hwloc-devel
FWIW / in addition to what Brice said: this is why hwloc also has "logical" 
ordering (in addition to the "physical" ordering).


On Sep 6, 2019, at 10:07 AM, Brice Goglin 
mailto:brice.gog...@inria.fr>> wrote:


Hello Jirka

I don't think there's a bug here.

physical_package_id don't have to be between 0 and N-1, they just have to be 
different to identify packages and cores between packages. Having other values 
is uncommon on x86 but quite common on POWER at least.

core_id is even worse. They are basically not used at all fortunately. They are 
often the same in both sockets. They are often discontigous inside sockets 
(maybe because CPU vendors disable specific cores in the middle of the CPU when 
your CPU doesn't have the max number of cores). On a dual-socket 20-core Xeon 
(Cascade Lake), both sockets have these core_ids: 
0,4,1,3,2,12,8,11,9,10,16,20,17,19,18,28,24,27,25,26 (5-7, 13-15 and 21-23 are 
missing).

PU and NUMA nodes often have contigous OS indexes, but not necessarily in order 
either.

FWIW, I get the same values as yours on a Gigabyte platform with 2x ThunderX2 
running RHEL7 4.14 kernel.

Brice



Le 06/09/2019 à 15:29, Jiri Hladky a écrit :
Hi all!

We are seeing strange CPU topology/numbering on a dual-socket ARM server with 
2×ThunderX2 CN9975 CPU [0].

Package IDs:
36 and 3180
cd /sys/devices/system/cpu
$ cat cpu0/topology/physical_package_id
36
Expected values: 0 and 1

Core IDs on the second socket:
256-283
$ cat cpu112/topology/core_id
256
$ cat cpu223/topology/core_id
283

Expected values for the second socket:
28 - 55

(On the first socket, the core numbering is OK - 0-27)

I assume this is Linux kernel bug. Have you seen anything like this in the 
past? What might a root cause? Linux kernel bug or perhaps a BIOS issue?

We see it on 5.3.0-0.rc7 and 4.18 kernels. I'm attaching lstopo and 
gather-topology output. I would appreciate any feedback on that.

Thank you!
Jirka


[0]
https://en.wikichip.org/wiki/cavium/thunderx2




___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel


--
Jeff Squyres
jsquy...@cisco.com

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] Strange CPU topology numbering on dual socket ARM server with 2×ThunderX2 CN9975

2019-09-06 Thread Brice Goglin
Hello Jirka

I don't think there's a bug here.

physical_package_id don't have to be between 0 and N-1, they just have
to be different to identify packages and cores between packages. Having
other values is uncommon on x86 but quite common on POWER at least.

core_id is even worse. They are basically not used at all fortunately.
They are often the same in both sockets. They are often discontigous
inside sockets (maybe because CPU vendors disable specific cores in the
middle of the CPU when your CPU doesn't have the max number of cores).
On a dual-socket 20-core Xeon (Cascade Lake), both sockets have these
core_ids: 0,4,1,3,2,12,8,11,9,10,16,20,17,19,18,28,24,27,25,26 (5-7,
13-15 and 21-23 are missing).

PU and NUMA nodes often have contigous OS indexes, but not necessarily
in order either.

FWIW, I get the same values as yours on a Gigabyte platform with 2x
ThunderX2 running RHEL7 4.14 kernel.

Brice



Le 06/09/2019 à 15:29, Jiri Hladky a écrit :
> Hi all! 
>
> We are seeing strange CPU topology/numbering on a dual-socket ARM
> server with 2×ThunderX2 CN9975 CPU [0].
>
> Package IDs:
> 36 and 3180
> cd /sys/devices/system/cpu
> $ cat cpu0/topology/physical_package_id
> 36
> Expected values: 0 and 1
>
> Core IDs on the second socket:
> 256-283
> $ cat cpu112/topology/core_id
> 256
> $ cat cpu223/topology/core_id
> 283
>
> Expected values for the second socket:
> 28 - 55
>
> (On the first socket, the core numbering is OK - 0-27)
>
> I assume this is Linux kernel bug. Have you seen anything like this in
> the past? What might a root cause? Linux kernel bug or perhaps a BIOS
> issue? 
>
> We see it on 5.3.0-0.rc7 and 4.18 kernels. I'm attaching lstopo and
> gather-topology output. I would appreciate any feedback on that. 
>
> Thank you!
> Jirka
>
>
> [0]
> https://en.wikichip.org/wiki/cavium/thunderx2
>
>
> ___
> hwloc-devel mailing list
> hwloc-devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-devel
___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel